1 ITEM RESPONSE THEORY MODELS & COMPOSITE MEASURES Sharon-Lise T. Normand Harvard Medical School and School of Public Health Funded, in part, by R01-MH54693 from the National Institute of Mental Health Normand (AcademyHealth 2009) 2 THE PROBLEM • Multiple measures are collected to characterize an endpoint. • Want to summarize: – The relationship of the measures to a covariate (say, treatment) OR – The underlying trait or construct the measures represent • How to combine or pool the multiple measures into a single summary measure? Normand (AcademyHealth 2009) 3 ASSUMPTIONS 1. The multiple measures or items are measured on the same scale. 2. The multiple items measure the same underlying construct. (1) And (2) imply the measures are commensurate. Normand (AcademyHealth 2009) 4 FOCUS: DISCRETE DATA How to combine data that are dichotomous or ordinal? Standard Pooling Rules: • “OR” rule (if any true, then composite is set to true) • “And” rule (all-or-none) Have very poor statistical properties! Normand (AcademyHealth 2009) 5 PROBLEMS WITH POOLING RULES • Optimal rule depends on type of measurement error • Pooling difficult if the items or method is not identical for the various sources • Not clearly defined in the presence of missing data • Forces investigator to put all measures on same scale*** Normand (AcademyHealth 2009) 6 WHAT IS AN ITEM RESPONSE THEORY (IRT) MODEL? A statistical model that relates the probability of response to item-specific parameters and to the subject’s underlying latent trait. Provides a formal mechanism to pool the data in order to obtain a composite. Normand (AcademyHealth 2009) 7 Classical Test Theory • Estimate reliability of items (coefficient α). • Model: Ysj = θs + εsj Ysj = response for sth subject to jth item θs = underlying trait εsj = error Normal with expectation 0 and constant variance. θs = “true” unobserved score. Normand (AcademyHealth 2009) Item Response Theory • Estimate discriminating ability of items using itemspecific parameters. • Responses within a subject are independent conditional on latent trait. • Normality & constant variance not assumed • θs ~ N(0,1) θs = “true” unobserved score. 8 DICHOTOMOUS OR ORDINAL RESPONSES Observed response is ysj; generalized linear model formulation: h(P(ysj = 1 given θs)) = αj(θs - βj) • • h = link function (logit or probit) αj and βj are “item” parameters. Normand (AcademyHealth 2009) 9 RASCH MODEL (1-PARAMETER LOGISTIC) Simplest IRT Model • Ysj = 1 if subject s responds correctly to item j and 0 otherwise. • θs = latent trait for subject s. • βj = difficulty of jth item. Probability subject s responds correctly jth item: P(Ysj=1|θs) = exp(θs - βj ) 1+exp(θs - βj ) Normand (AcademyHealth 2009) RASCH MODEL: 3 SUBJECTS WITH DIFFERENT TRAITS 1.0 o o o o o Probability of Correct Response 0.9 Theta=1 Theta=0 Theta=-1 o o o 0.8 o 0.7 Item Characteristic Curves (ICC) o 0.6 o 0.5 0.4 o 0.3 o 0.2 o o 0.1 o o 0.0 -4 -3 -2 -1 0 Beta Normand (AcademyHealth 2009) 1 2 = DIFFICULTY 3 4 10 11 2-PARAMETER LOGISTIC • Ysj = 1 if subject s responds correctly to item j and 0 otherwise. • θs = latent trait for subject s. • βj = difficulty of jth item. • αj = discrimination of jth item (αj > 0) Probability subject s responds correctly jth item: P(Ysj=1|θs) = exp(αj(θs - βj) 1+exp(αj(θs - βj)) Normand (AcademyHealth 2009) 12 2-PARAMETER LOGISTIC: 3 ITEMS (β = 1) 1.0 o o o o 0.9 Probability of Correct Response o o o o o o o 0.8 o α=1 0.7 o o o o 0.6 o o o 0.5 o 0.4 α = 0.5 0.3 0.0 o α=3 o o o o 0.2 0.1 o o o o o o -4 o o o -3 o o o o o o o o o -2 o o o o -1 o 0 Theta Normand (AcademyHealth 2009) 1 2 3 4 2-PARAMETER LOGISTIC: 3 ITEMS & DIFFERENT β’s 1.0 Item 1: alpha=1,beta=1 Item 2: alpha=3,beta=0 Item 3: alpha=.5,beta=1 Probability of Correct Response 0.9 o o o o 0.8 α = 3, β = 0 o 0.7 o 0.6 o 0.5 α = 0.5, β = 1 0.4 o 0.3 o 0.2 o o 0.1 0.0 o -4 o o -3 o o -2 α = 1, β = -1 o -1 0 Theta Normand (AcademyHealth 2009) 1 2 3 4 13 14 APPLICATIONS 1. Depression: use a set of items endorsed by a subject to determine level of depression. 2. Physician-level quality: use of set of quality indicators at the patient-level to learn about physician-level quality. 3. Hospital-level quality: use of set of quality indicators at the patient-level to learn about hospital-level quality Normand (AcademyHealth 2009) 15 COMPOSITE INTERNATIONAL DIAGNOSTIC INTERVIEW (CIDI) •Several binary-valued questions •Lost appetite; lost weight; trouble falling asleep; lack energy; felt worthless; felt inferior; felt guilty; etc. •Does measurement bias in the CIDI exist? Occurs when group is less likely to meet DSM criteria despite a similar level of underlying disorder. •Breslau et al. (2008) studied this for depression Normand (AcademyHealth 2009) 16 Probability of Endorsing 'Felt Worthless' Item Illustration of differential Item functioning (DIF) as variation in item characteristic curves (ICC)* 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 Le v e l of D e pr e ssi on Whit es Blacks *ICCs estimated using 2-parameter logistic model (Breslau et al. Differential item functioning between ethnic groups in the epidemiological assessment of depression. The Journal of Nervous and Mental Disease, 2008;196(4)). Normand (AcademyHealth 2009) 17 HOSPITAL-LEVEL MEASUREMENT • Multiple measures of hospital quality promulgated • Process and outcome measures for different diseases • How to determine quality of hospital for the treatment of a particular disease? Normand (AcademyHealth 2009) 18 HOSPITAL QUALITY FOR HEART FAILURE (3376 US Hospitals; 2005) Measure LVF assessment ACE or ARB for LVSD Smoking cessation advice Discharge instructions Median # Eligible Patients Median% Compliant [10th; 90th] [10th; 90th] 200 [51;580] 34 [5;120] 89 [64; 98] 83 [60; 100] 25 [1;98] 79 [40; 100] 135 [0; 469] 55 [15; 87] LVF = left ventricular function; LVSD = left ventricular systolic dysfunction Teixeira-Pinto and Normand – Statistics in Medicine (2008) Normand (AcademyHealth 2009) 19 Hospital Performance Ysj β j ,θ s ~ Bin(n sj ,p sj ) where logit(p sj ) = iid β1j (θ s - β 0 j ) and θ s ~ N (0,1) •Ysj = no. of eligible cases in sth hospital getting treatment j. •β0j = “difficulty” of the jth process measure. •β1j = “discriminating” ability of the process measure. •θs = underlying quality of care for sth hospital. Normand (AcademyHealth 2009) 1344 A. TEIXEIRA-PINTO AND S.-L. T. NORMAND Raw average score Raw weighted average score 800 Number of hospitals Number of hospitals 800 600 400 200 0 600 400 200 0 0 20 40 60 80 Raw average score 100 0 20 40 60 80 Raw weighted average score 100 Individual measures for the heart failure latent score Prob(receiving therapy) 1.0 0.8 0.6 0.4 Discharge Instructions Evaluation of LVS Function ACEI or ARB for LVSD Adult Smoking Cessation Advice 0.2 0.0 Latent score Figure 6. Distribution of the raw average score and raw-weighted average score; and estimated probability of receiving each therapy as a function of the latent score for patients hospitalized with heart failure. for the superior performance category is chosen so that 10 per cent of the hospitals are classified as having superior performance. The comparison between the classification based on the RAS and RWAS point estimate, and the LS credible interval is obviously not fair because they are based on different classification rules. However, it serves the purpose of illustrating how the classification changes depending on the methodology used. For AMI only 103 out of 245 hospitals (42 per cent) are classified in the superior performance category both by the RAS point estimate and the LS credible interval procedure. For this condition, the classification of superior performance based on the RWAS and LS coincided in 169 out of 245 (69 per cent) hospitals. For HF, 247 out of 333 hospitals (74 per cent) are classified in the superior performance category both by the RAS point estimate and the LS credible interval procedure. The classification of superior performance based on the RWAS and LS coincided in 280 out of 338 (83 per cent) hospitals for HF. Regarding small hospitals, the approach based on the credible intervals for the LS is more restrictive than the one based on the point estimate. For the classification using the point estimate alone of the LS, the proportion of small volume hospitals in the higher category of performance Copyright q 2008 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:1329–1350 DOI: 10.1002/sim 21 Comparing Composites: (Teixeira-Pinto and Normand, Statistics in Medicine (2008)) 2005 Data Normand (AcademyHealth 2009) 22 Concluding Remarks: Composite - Pros (Kaplan & Normand 2006) Composite - Cons Summarizes a large amount of information into a simpler measure May be difficult to interpret – what do the units mean? Facilitates provider ranking Difficult to validate Improves reliability of provider measure Does not necessarily guide quality and thus reduces the number of improvement; the individual quality individual quality measures that measures are needed need to be collected Fairer to providers – different ways to get good composite scores Quality information may be wasted or hidden in the composite measure Reduces the time frame over which quality is assessed by effectively increasing the sample size The weighting scheme to create the composite score may not be transparent (scoring) Not new or unique to health care: intelligence, aptitude, mental illness, and personality used for over a century; economics/business; education (student and teacher performance); clinical trials Normand (AcademyHealth 2009)