ITEM RESPONSE THEORY MODELS & COMPOSITE MEASURES National Institute of Mental Health

advertisement
1
ITEM RESPONSE THEORY MODELS &
COMPOSITE MEASURES
Sharon-Lise T. Normand
Harvard Medical School and School of Public Health
Funded, in part, by R01-MH54693 from the
National Institute of Mental Health
Normand (AcademyHealth 2009)
2
THE PROBLEM
• Multiple measures are collected to
characterize an endpoint.
• Want to summarize:
– The relationship of the measures to a covariate
(say, treatment) OR
– The underlying trait or construct the measures
represent
• How to combine or pool the multiple
measures into a single summary measure?
Normand (AcademyHealth 2009)
3
ASSUMPTIONS
1. The multiple measures or items are
measured on the same scale.
2. The multiple items measure the same
underlying construct.
(1) And (2) imply the measures are
commensurate.
Normand (AcademyHealth 2009)
4
FOCUS: DISCRETE DATA
How to combine data that are
dichotomous or ordinal?
Standard Pooling Rules:
• “OR” rule (if any true, then composite is
set to true)
• “And” rule (all-or-none)
Have very poor statistical properties!
Normand (AcademyHealth 2009)
5
PROBLEMS WITH POOLING RULES
• Optimal rule depends on type of
measurement error
• Pooling difficult if the items or method is
not identical for the various sources
• Not clearly defined in the presence of
missing data
• Forces investigator to put all measures
on same scale***
Normand (AcademyHealth 2009)
6
WHAT IS AN ITEM RESPONSE
THEORY (IRT) MODEL?
A statistical model that relates the probability
of response to item-specific parameters
and to the subject’s underlying latent trait.
Provides a formal mechanism to pool the data
in order to obtain a composite.
Normand (AcademyHealth 2009)
7
Classical Test Theory
• Estimate reliability of items
(coefficient α).
• Model:
Ysj = θs + εsj
Ysj = response for sth subject
to jth item
θs = underlying trait
εsj = error Normal with
expectation 0 and
constant variance.
θs = “true” unobserved
score.
Normand (AcademyHealth 2009)
Item Response Theory
• Estimate discriminating
ability of items using itemspecific parameters.
• Responses within a subject
are independent
conditional on latent trait.
• Normality & constant
variance not assumed
• θs ~ N(0,1)
θs = “true” unobserved
score.
8
DICHOTOMOUS OR ORDINAL
RESPONSES
Observed response is ysj; generalized
linear model formulation:
h(P(ysj = 1 given θs)) = αj(θs - βj)
•
•
h = link function (logit or probit)
αj and βj are “item” parameters.
Normand (AcademyHealth 2009)
9
RASCH MODEL
(1-PARAMETER LOGISTIC)
Simplest IRT Model
•
Ysj = 1 if subject s responds correctly to item
j and 0 otherwise.
• θs = latent trait for subject s.
• βj = difficulty of jth item.
Probability subject s responds correctly jth item:
P(Ysj=1|θs) = exp(θs - βj )
1+exp(θs - βj )
Normand (AcademyHealth 2009)
RASCH MODEL: 3 SUBJECTS WITH DIFFERENT TRAITS
1.0
o
o
o
o
o
Probability of Correct Response
0.9
Theta=1
Theta=0
Theta=-1
o
o
o
0.8
o
0.7
Item Characteristic
Curves (ICC)
o
0.6
o
0.5
0.4
o
0.3
o
0.2
o
o
0.1
o
o
0.0
-4
-3
-2
-1
0
Beta
Normand (AcademyHealth 2009)
1
2
= DIFFICULTY
3
4
10
11
2-PARAMETER LOGISTIC
•
Ysj = 1 if subject s responds correctly to
item j and 0 otherwise.
• θs = latent trait for subject s.
• βj = difficulty of jth item.
• αj = discrimination of jth item (αj > 0)
Probability subject s responds correctly jth item:
P(Ysj=1|θs) = exp(αj(θs - βj)
1+exp(αj(θs - βj))
Normand (AcademyHealth 2009)
12
2-PARAMETER LOGISTIC: 3 ITEMS (β = 1)
1.0
o
o
o
o
0.9
Probability of Correct Response
o
o
o
o
o
o
o
0.8
o
α=1
0.7
o
o
o
o
0.6
o
o
o
0.5
o
0.4
α = 0.5
0.3
0.0
o
α=3
o
o
o
o
0.2
0.1
o
o
o
o
o
o
-4
o
o
o
-3
o
o
o
o
o
o
o
o
o
-2
o
o
o
o
-1
o
0
Theta
Normand (AcademyHealth 2009)
1
2
3
4
2-PARAMETER LOGISTIC: 3 ITEMS & DIFFERENT β’s
1.0
Item 1: alpha=1,beta=1
Item 2: alpha=3,beta=0
Item 3: alpha=.5,beta=1
Probability of Correct Response
0.9
o
o
o
o
0.8
α = 3, β = 0
o
0.7
o
0.6
o
0.5
α = 0.5, β = 1
0.4
o
0.3
o
0.2
o
o
0.1
0.0
o
-4
o
o
-3
o
o
-2
α = 1, β = -1
o
-1
0
Theta
Normand (AcademyHealth 2009)
1
2
3
4
13
14
APPLICATIONS
1. Depression: use a set of items endorsed by
a subject to determine level of depression.
2. Physician-level quality: use of set of quality
indicators at the patient-level to learn about
physician-level quality.
3. Hospital-level quality: use of set of quality
indicators at the patient-level to learn about
hospital-level quality
Normand (AcademyHealth 2009)
15
COMPOSITE INTERNATIONAL
DIAGNOSTIC INTERVIEW (CIDI)
•Several binary-valued questions
•Lost appetite; lost weight; trouble falling
asleep; lack energy; felt worthless; felt inferior;
felt guilty; etc.
•Does measurement bias in the CIDI exist?
Occurs when group is less likely to meet DSM
criteria despite a similar level of underlying
disorder.
•Breslau et al. (2008) studied this for
depression
Normand (AcademyHealth 2009)
16
Probability of Endorsing 'Felt Worthless'
Item
Illustration of differential Item functioning (DIF)
as variation in item characteristic curves (ICC)*
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
Le v e l of D e pr e ssi on
Whit es
Blacks
*ICCs estimated using 2-parameter logistic model (Breslau et al. Differential item functioning
between ethnic groups in the epidemiological assessment of depression. The Journal of
Nervous and Mental Disease, 2008;196(4)).
Normand (AcademyHealth 2009)
17
HOSPITAL-LEVEL MEASUREMENT
• Multiple measures of hospital quality
promulgated
• Process and outcome measures for
different diseases
• How to determine quality of hospital for
the treatment of a particular disease?
Normand (AcademyHealth 2009)
18
HOSPITAL QUALITY FOR HEART FAILURE
(3376 US Hospitals; 2005)
Measure
LVF assessment
ACE or ARB for LVSD
Smoking cessation
advice
Discharge instructions
Median #
Eligible Patients
Median%
Compliant
[10th; 90th]
[10th; 90th]
200 [51;580]
34 [5;120]
89 [64; 98]
83 [60; 100]
25 [1;98]
79 [40; 100]
135 [0; 469]
55 [15; 87]
LVF = left ventricular function; LVSD = left ventricular systolic dysfunction
Teixeira-Pinto and Normand – Statistics in Medicine (2008)
Normand (AcademyHealth 2009)
19
Hospital Performance
Ysj β j ,θ s ~ Bin(n sj ,p sj ) where logit(p sj ) =
iid
β1j (θ s - β 0 j ) and θ s ~ N (0,1)
•Ysj = no. of eligible cases in sth hospital getting
treatment j.
•β0j = “difficulty” of the jth process measure.
•β1j = “discriminating” ability of the process measure.
•θs = underlying quality of care for sth hospital.
Normand (AcademyHealth 2009)
1344
A. TEIXEIRA-PINTO AND S.-L. T. NORMAND
Raw average score
Raw weighted average score
800
Number of hospitals
Number of hospitals
800
600
400
200
0
600
400
200
0
0
20
40
60
80
Raw average score
100
0
20
40
60
80
Raw weighted average score
100
Individual measures for the heart failure latent score
Prob(receiving therapy)
1.0
0.8
0.6
0.4
Discharge Instructions
Evaluation of LVS Function
ACEI or ARB for LVSD
Adult Smoking Cessation Advice
0.2
0.0
Latent score
Figure 6. Distribution of the raw average score and raw-weighted average score; and estimated probability
of receiving each therapy as a function of the latent score for patients hospitalized with heart failure.
for the superior performance category is chosen so that 10 per cent of the hospitals are classified
as having superior performance. The comparison between the classification based on the RAS and
RWAS point estimate, and the LS credible interval is obviously not fair because they are based on
different classification rules. However, it serves the purpose of illustrating how the classification
changes depending on the methodology used. For AMI only 103 out of 245 hospitals (42 per
cent) are classified in the superior performance category both by the RAS point estimate and
the LS credible interval procedure. For this condition, the classification of superior performance
based on the RWAS and LS coincided in 169 out of 245 (69 per cent) hospitals. For HF, 247
out of 333 hospitals (74 per cent) are classified in the superior performance category both by
the RAS point estimate and the LS credible interval procedure. The classification of superior
performance based on the RWAS and LS coincided in 280 out of 338 (83 per cent) hospitals
for HF.
Regarding small hospitals, the approach based on the credible intervals for the LS is more
restrictive than the one based on the point estimate. For the classification using the point estimate
alone of the LS, the proportion of small volume hospitals in the higher category of performance
Copyright q
2008 John Wiley & Sons, Ltd.
Statist. Med. 2008; 27:1329–1350
DOI: 10.1002/sim
21
Comparing Composites:
(Teixeira-Pinto and Normand, Statistics in Medicine (2008))
2005 Data
Normand (AcademyHealth 2009)
22
Concluding Remarks:
Composite - Pros
(Kaplan & Normand 2006)
Composite - Cons
Summarizes a large amount of
information into a simpler measure
May be difficult to interpret – what do
the units mean?
Facilitates provider ranking
Difficult to validate
Improves reliability of provider measure Does not necessarily guide quality
and thus reduces the number of
improvement; the individual quality
individual quality measures that
measures are needed
need to be collected
Fairer to providers – different ways to
get good composite scores
Quality information may be wasted or
hidden in the composite measure
Reduces the time frame over which
quality is assessed by effectively
increasing the sample size
The weighting scheme to create the
composite score may not be
transparent (scoring)
Not new or unique to health care: intelligence, aptitude, mental illness, and
personality used for over a century; economics/business; education (student
and teacher performance); clinical trials
Normand (AcademyHealth 2009)
Download