Evaluating Health-Related Quality of Life Measures Ron D. Hays, Ph.D. UCLA GIM & HSR February 9, 2015 (9:00-11:50 am) HPM 214, Los Angeles, CA Where are we now in HPM 214? http://hpm214.med.ucla.edu/ 1. 2. 3. 4. 5. 6. 7. 8. 9. Introduction to Outcomes and Effectiveness HRQOL Profile Measures HRQOL Preference-Based Measures Designing HRQOL Measures Evaluating HRQOL Measures PROMIS/IRT/Internet Panels Responding to reviews Course Review (Cognitive interview assignment due) Final Exam (3/16/15) 2 The 2nd class assignment is to conduct and summarize 5 cognitive interviews with a self-administered HRQOL survey instrument. Your written summary should be no more than 3 pages in length. Longer summaries will not be accepted. You are required to conduct 5 (and no more than 5) cognitive interviews with every item in your selected instrument. If you have a long instrument you can parse it up so that each respondent does not have to be interviewed on every item but 5 people need to be exposed to each item. http://www.chime.ucla.edu/qualitativemethods.htm.The cognitive interview write-up is due at 9am on 03/09/15. ---------------------------------------------------------------------Extra credit can be obtained by writing a 2-page review of a published HRQOL article. The article selected needs to be cleared with the instructor in advance. Four Levels of Measurement • • • • Nominal (categorical) Ordinal (rank) Interval (numerical) Ratio (numerical) Levels of Measurement and Their Properties Property Magnitude Equal Interval Absolute 0 Nominal No No No Ordinal Yes No No Interval Yes Yes No Ratio Yes Yes Yes Level Ordinal Scale • In general, how would you rate your health? – Excellent – Very good – Good – Fair – Poor Ordinal Scale • In general, how would you rate your health is … – 100 = Excellent? – 075 = Very good? – 050 = Good? – 025 = Fair? – 000 = Poor? [84] [61] [76] [52] [26] Interval Scales • Fahrenheit and Centigrade temperature – T(°C) = (T(°F) - 32) × 5/9 • 40°C ≠ 2 times as hot as 20°C • 104°F ≠ 2 times as hot as 68°F Ratio Scales • Kelvin Temperature Scale (absolute 0) • Days spent in hospital in last 30 days • Age A 4- year old is twice as old as a 2-year old. If you subtract 1 from both of their ages, then 4 becomes 3 and 2 becomes 1. The 4-year old is still twice as old as the 2-year old despite the new age values being 3 versus 1 (i.e., “0” no longer means zero years). Measurement Range for HRQOL Measures Nominal Ordinal Interval Ratio Levels of Measurement and Their Properties Item Person Magnitude Equal Interval Absolute 0 Total Score Nominal No No No 0 Ordinal Yes No No 1 Interval Yes Yes No 2 Ratio Yes Yes Yes 3 Four Types of Data Collection Errors Coverage Error • Does each person in target population have an equal chance of selection? Sampling Error • Only some members of the target population are sampled. Nonresponse Error • Do people in the sample who respond differ from those who do not? Measurement Error • Inaccuracy in answers given to survey questions. 12 Characteristics of Good Measures • • • • • Acceptability Variability Reliability Validity Interpretability Indicators of Acceptability • Response rate • Administration time • Missing data (item, scale) Variability • Responses fall in each response category • Distribution approximates bell-shaped “normal” curve (68.2%, 95.4%, and 99.6%) Reliability Reliability is the degree to which the same score is obtained for thing being measured (person, plant or whatever) when that thing hasn’t changed. – Ratio of signal to noise Observed Score is: observed “true” = + score score systematic error random + error Flavors of Reliability • Inter-rater (rater) – Need 2 or more raters of the thing being measured • Test-retest (administrations) – Need 2 or more time points • Internal consistency (items) – Need 2 or more items Reliability Minimum Standards • 0.70 or above (for group comparisons) • 0.90 or higher (for individual assessment) SEM = SD (1- reliability)1/2 95% CI = true score +/- 1.96 x SEM if z-score = 0, then CI: -.62 to +.62 when reliability = 0.90 Width of CI is 1.24 z-score units Hypothetical Ratings of Performance of Six Students in HPM 214 by Two Raters Using Excellent to Poor Scale [1 = Poor; 2 = Fair; 3 = Good; 4 = Very good; 5 = Excellent] 1= Julian (Good, Very Good) 2= Narissa (Very Good, Excellent) 3= Alina (Good, Good) 4= Greg (Fair, Poor) 5= Linda (Excellent, Very Good) 6= Caroline (Fair, Fair) (Target = 6 students; assessed by 2 raters) Kappa Coefficient of Agreement (Corrects for Chance) kappa = “Quality Index” (observed - chance) (1 - chance) Cross-Tab of Ratings Rater 1 P P 0 Rater 2 F F G E 1 1 1 1 VG 1 E 0 VG 1 G Total Total 2 2 1 0 1 2 1 0 1 1 1 6 Calculating KAPPA PC = Pobs. = Kappa = (0 x 1) + (2 x 1) + (2 x 1) + (1 x 2) + (1 x 1) (6 x 6) 2 6 = 0.33 0.33– 0.19 1 - 0.19 = 0.17 = 0.19 Guidelines for Interpreting Kappa Conclusion Kappa Conclusion Kappa Poor Fair < .40 Poor < 0.0 .40 - .59 Slight .00 - .20 Good .60 - .74 Fair .21 - .40 > .74 Moderate .41 - .60 Substantial .61 - .80 Almost perfect .81 - 1.00 Excellent Fleiss (1981) Landis and Koch (1977) Weighted Kappa (Linear and Quadratic) P F G VG E P 1 .75 (.937) .50 (.750) .25 (.437) 0 F .75 (.937) 1 .75 (.937) .50 (.750) .25 (.437) G .50 (.750) .75 (.937) 1 .75 (.937) .50 (.750) VG .25 (.437) .50 (.750) .75 (.937) 1 .75 (.937) 0 .25 (.437) .5 (.750) .75 (.937) 1 E Wl = 1 – ( i/ (k – 1)) W q = 1 – (i2 / (k – 1) 2) i = number of categories ratings differ by k = n of categories Linear weighted kappa = 0.52; Quadratic weighted kappa = 0.77 Intraclass Correlation and Reliability Model Reliability Intraclass Correlation Oneway MS BMS MSW MS MS BMS MS BMS MSW MS MS BMS (k 1) MSW MS Twoway mixed MS BMS MS EMS MS BMS MS BMS MS EMS MS BMS (k 1) MS EMS Two-way random N ( MS BMS MS EMS ) NMS BMS MS JMS MS EMS MS BMS MS EMS MS BMS (k 1) MS EMS k ( MS JMS MS EMS ) / N BMS = Between Ratee Mean Square N = n of ratees WMS = Within Mean Square k = n of items or raters JMS = Item or Rater Mean Square EMS = Ratee x Item (Rater) Mean Square 26 01 13 01 24 02 14 02 25 03 13 03 23 04 12 04 21 05 15 05 24 06 12 06 22 Two-Way Random Effects (Reliability of Performance Ratings) Source df Students (BMS) Raters (JMS) Stud. x Raters (EMS) 5 1 5 15.67 0.00 2.00 11 17.67 Total SS = 0.89 2-way R = 6 (3.13 - 0.40) 6 (3.13) + 0.00 - 0.40 MS 3.13 0.00 0.40 ICC = 0.80 Responses of Students to Two Questions about Their Health 1= Julian (Good, Very Good) 2= Narissa (Very Good, Excellent) 3= Alina (Good, Good) 4= Greg (Fair, Poor) 5= Linda (Excellent, Very Good) 6= Caroline (Fair, Fair) (Target = 6 students; assessed by 2 items) 01 34 02 45 03 33 04 21 05 54 06 22 Two-Way Mixed Effects (Cronbach’s Alpha) Source df Respondents (BMS) Items (JMS) Resp. x Items (EMS) Total Alpha = SS 5 1 5 15.67 0.00 2.00 11 17.67 3.13 - 0.40 = 2.93 = 0.87 3.13 3.13 MS 3.13 0.00 0.40 ICC = 0.77 Satisfaction of 12 Family Members with 6 Students (2 per student) 1. Julian (fam1: Good, fam2: Very Good) 2. Narissa (fam3: Very Good, fam4: Excellent) 3. Alina (fam5: Good, fam6: Good) 4. Greg (fam7: Fair, fam8: Poor) 5. Linda (fam9: Excellent, fam10: Very Good) 6. Caroline (fam11: Fair, fam12: Fair) (Target = 6 students; assessed by 2 family members each) 01 13 01 24 02 34 02 45 03 53 03 63 04 72 04 81 05 95 05 04 06 12 06 22 One-Way ANOVA (Reliability of Ratings of Students) Source df SS MS Respondents (BMS) Within (WMS) 5 6 15.67 2.00 3.13 0.33 11 17.67 Total 1-way = 3.13 - 0.33 = 2.80 = 0.89 3.13 3.13 Standardized Alpha for Different Numbers of Items and Average Inter-item Correlation Average Inter-item Correlation ( r ) Number of Items (k) 2 4 6 8 .0 .000 .000 .000 .000 .2 .333 .500 .600 .666 Alphast = .4 .572 .727 .800 .842 .6 .750 .857 .900 .924 k* r 1 + (k -1) * r .8 1.0 .889 1.000 .941 1.000 .960 1.000 .970 1.000 Spearman-Brown Prophecy Formula alpha y = ( N • alpha x 1 + (N - 1) * alpha x ) N = how much longer scale y is than scale x Example Spearman-Brown Calculations Estimating the reliability of the MHI-18 from the MHI-32 18/32 (0.98) (1+(18/32 –1)*0.98 = 0.55125 0.57125 = 0.96 Number of Items and Reliability: Three Versions of the Mental Health Inventory (MHI) Measure Number of Completion Items Time (min.) Reliability .98 MHI-32 32 5-8 MHI-18 18 3-5 .96 MHI-5 5 1 or less .90 Data from McHorney et al. 1992 Multitrait Scaling Analysis • Internal consistency reliability – Item convergence • Item discrimination Item-scale correlation matrix Item #1 Item #2 Item #3 Item #4 Item #5 Item #6 Item #7 Item #8 Item #9 Depress Anxiety 0.80* 0.80* 0.80* 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.80* 0.80* 0.80* 0.20 0.20 0.20 Anger 0.20 0.20 0.20 0.20 0.20 0.20 0.80* 0.80* 0.80* *Item-scale correlation, corrected for overlap. 37 Item-scale correlation matrix Item #1 Item #2 Item #3 Item #4 Item #5 Item #6 Item #7 Item #8 Item #9 Depress Anxiety 0.50* 0.50* 0.50* 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50* 0.50* 0.50* 0.50 0.50 0.50 Anger 0.50 0.50 0.50 0.50 0.50 0.50 0.50* 0.50* 0.50* *Item-scale correlation, corrected for overlap. 38 Validity • Does instrument measure what it is supposed to measure? • A “validated” instrument is a holy grail Reliability and Validity Threats to Validity • Acquiescent Response Set • Socially Desirable Response Set Listed below are a few statements about your relationships with others. How much is each statement TRUE or FALSE for you? 1. I am always courteous even to people who are disagreeable. 2. There have been occasions when I took advantage of someone. 3. I sometimes try to get even rather than forgive and forget. 4. I sometimes feel resentful when I don’t get my way. 5. No matter who I’m talking to, I’m always a good listener. Definitely true; Most true; Don’t know; Mostly false; Definitely false Two Types of Validity • Content Validity – Includes face validity • Construct Validity – Many synonyms Content Validity • Does the measure adequately represent the domain? – Do items operationalize concept? – Do items cover all aspects of concept? – Does scale name represent item content? • Face validity is extent to which measure “appears” to reflect what it is intended to – E.g., by expert judges or by patient focus groups Construct Validity • Do scores on a measure relate to other variables in ways consistent with hypotheses? Evaluating Construct Validity Scale Age Obesity ESRD Nursing Home Resident Physical Functioning Medium (-). Small (-) Large (-) Large (-) Depressive Symptoms ? Small (+) ? Medium (+) Cohen effect size rules of thumb (d = 0.2, 0.5, and 0.8): Small correlation = 0.100 Medium correlation = 0.243 Large correlation = 0.371 r = d / [(d2 + 4).5] = 0.8 / [(0.82 + 4).5] = 0.8 / [(0.64 + 4).5] = 0.8 / [( 4.64).5] = 0.8 / 2.154 = 0.371 (Beware r’s of 0.10, 0.30 and 0.50 are often cited as small, medium, and large.) Relative Validity Analyses • Form of "known groups" validity • Relative sensitivity of measure to important clinical difference • One-way between group ANOVA Relative Validity Example Severity of Heart Disease None Mild Severe F-ratio Relative Validity Scale #1 87 90 91 2 -- Scale #2 74 78 88 10 5 Scale #3 77 87 95 20 10 Responsiveness to Change • HRQOL measures should be responsive to interventions that changes HRQOL • Need external indicators of change (Anchors) Self-Report Indicator of Change • Overall has there been any change in your asthma since the beginning of the study? Much improved; Moderately improved; Minimally improved No change Minimally worse; Moderately worse; Much worse Clinical Indicator of Change • “changed” group = seizure free (100% reduction in seizure frequency) • “unchanged” group = <50% change in seizure frequency Responsiveness Indices (1) Effect size (ES) = D/SD (2) Standardized Response Mean (SRM) = D/SD† (3) Guyatt responsiveness statistic (RS) = D/SD‡ D = raw score change in “changed” group; SD = baseline SD; SD† = SD of D; SD‡ = SD of D among “unchanged” Effect Size Benchmarks • Small: 0.20->0.49 • Moderate: 0.50->0.79 • Large: 0.80 or above Minimally Important Difference (MID) • External anchors – Self-report – Provider report – Clinical measure – Intervention • Anchor correlated with change on target measure at 0.371 or higher • Anchor indicates “minimal” change Change in Physical Function Baseline = 100 (U.S. males mean = 87, SD = 20) • Hit by Bike causes me to be limited a lot in vigorous activities, limited a little in moderate activities, and limited a lot in climbing several flights of stairs. Physical functioning drops to 75 (-1.25 SD) • Hit by Rock causes me to be limited a little in vigorous activities and physical functioning drops to 95 (- 0.25 SD) Example with Multiple Anchors • 693 RA clinical trial participants evaluated at baseline and 6weeks post-treatment. • Five anchors: 1. 2. 3. 4. 5. patient global self-report; physician global report; pain self-report; joint swelling; joint tenderness Kosinski, M. et al. (2000). Determining minimally important changes in generic and diseasespecific health-related quality of life questionnaires in clinical trials of rheumatoid arthritis. Arthritis and Rheumatism, 43, 1478-1487. Patient and Physician Global Reports How are you (is the patient) doing, considering all the ways that RA affects you (him/her)? • • • • Very good (asymptomatic and no limitation of normal activities) Good (mild symptoms and no limitation of normal activities) Fair (moderate symptoms and limitation of normal activities) Poor (severe symptoms and inability to carry out most normal activities) • Very poor (very severe symptoms that are intolerable and inability to carry out normal activities --> Improvement of 1 level over time Global Pain, Joint Swelling and Tenderness • 0 = no pain, 10 = severe pain • Number of swollen and tender joints -> 1-20% improvement over time Effect Sizes (mean = 0.34) for SF-36 Changes Linked to Minimal Change in Anchors Scale Self-R Clin.-R Pain Swell Tender Mean PF .35 .33 .34 .26 .32 .32 Role-P .56 .52 .29 .35 .36 .42 Pain .83 .70 .47 .69 .42 .62 GH .20 .12 .09 .12 .04 .12 EWB .39 .26 .25 .18 .05 .23 Role-E .41 .28 .18 .38 .26 .30 SF .43 .34 .28 .29 .38 .34 EF .50 .47 .22 .22 .35 .35 PCS .49 .48 .34 .29 .36 .39 MCS .42 .27 .19 .27 .20 .27 Appendix-ANOVA Computations • A. Student’s SS (72+92+62+32+92+42)/2 – 382/12 = 15.67 • B. Rater/Item SS (192+192)/6 – 382/12 = 0.00 • C. Total SS (32+ 42+42+52+32+32+22+12+52+42+22+22) – 382/10 = 17.67 • Student x Item SS= A – (B + C SS) options ls=130 ps=52 nocenter; options nofmterr; data one; input id 1-2 rater 4 rating 5; CARDS; 01 13 01 24 02 14 02 25 03 13 03 23 04 12 04 21 05 15 05 24 06 12 06 22 ; run; **************; proc freq; tables rater rating; run; *******************; proc means; var rater rating; run; *******************************************; proc anova; class id rater; model rating=id rater id*rater; run; *******************************************; data one; input id 1-2 rater 4 rating 5; CARDS; 01 13 01 24 02 14 02 25 03 13 03 23 04 12 04 21 05 15 05 24 06 12 06 22 ; run; *************************************************************** ***; %GRIP(indata=one,targetv=id,repeatv=rater,dv=rating, type=1,t1=test of GRIP macro,t2=); GRIP macro is available at: http://gim.med.ucla.edu/FacultyPages/Hays/util.htm data one; input id 1-2 rater1 4 rater2 5; control=1; CARDS; 01 34 02 45 03 33 04 21 05 54 06 22 ; run; **************; DATA DUMMY; INPUT id 1-2 rater1 4 rater2 5; CARDS; 01 11 02 22 03 33 04 44 05 55 RUN; DATA NEW; SET ONE DUMMY; PROC FREQ; TABLES CONTROL*RATER1*RATER2 /NOCOL NOROW NOPERCENT AGREE; *******************************************; data one; set one; *****************************************; proc means; var rater1 rater2; run; *******************************************; proc corr alpha; var rater1 rater2; run;