Course Overview - HPM214 Home

advertisement
Evaluating Health-Related
Quality of Life Measures
Ron D. Hays, Ph.D.
UCLA GIM & HSR
February 9, 2015 (9:00-11:50 am)
HPM 214, Los Angeles, CA
Where are we now in HPM 214?
http://hpm214.med.ucla.edu/
1.
2.
3.
4.
5.
6.
7.
8.
9.
Introduction to Outcomes and Effectiveness
HRQOL Profile Measures
HRQOL Preference-Based Measures
Designing HRQOL Measures
Evaluating HRQOL Measures 
PROMIS/IRT/Internet Panels
Responding to reviews
Course Review (Cognitive interview assignment due)
Final Exam (3/16/15)
2
The 2nd class assignment is to conduct and summarize 5
cognitive interviews with a self-administered HRQOL
survey instrument. Your written summary should be no
more than 3 pages in length. Longer summaries will not be
accepted. You are required to conduct 5 (and no more than
5) cognitive interviews with every item in your selected
instrument. If you have a long instrument you can parse it
up so that each respondent does not have to be interviewed
on every item but 5 people need to be exposed to each item.
http://www.chime.ucla.edu/qualitativemethods.htm.The
cognitive interview write-up is due at 9am on 03/09/15.
---------------------------------------------------------------------Extra credit can be obtained by writing a 2-page review of
a published HRQOL article. The article selected needs to
be cleared with the instructor in advance.
Four Levels of Measurement
•
•
•
•
Nominal (categorical)
Ordinal (rank)
Interval (numerical)
Ratio (numerical)
Levels of Measurement
and Their Properties
Property
Magnitude
Equal
Interval
Absolute 0
Nominal
No
No
No
Ordinal
Yes
No
No
Interval
Yes
Yes
No
Ratio
Yes
Yes
Yes
Level
Ordinal Scale
• In general, how would you rate your health?
– Excellent
– Very good
– Good
– Fair
– Poor
Ordinal Scale
• In general, how would you rate your health is
…
– 100 = Excellent?
– 075 = Very good?
– 050 = Good?
– 025 = Fair?
– 000 = Poor?
[84]
[61]
[76]
[52]
[26]
Interval Scales
• Fahrenheit and Centigrade temperature
– T(°C) = (T(°F) - 32) × 5/9
• 40°C ≠ 2 times as hot as 20°C
• 104°F ≠ 2 times as hot as 68°F
Ratio Scales
• Kelvin Temperature Scale (absolute 0)
• Days spent in hospital in last 30 days
• Age
A 4- year old is twice as old as a 2-year old. If
you subtract 1 from both of their ages, then 4
becomes 3 and 2 becomes 1. The 4-year old
is still twice as old as the 2-year old despite
the new age values being 3 versus 1 (i.e., “0”
no longer means zero years).
Measurement Range for
HRQOL Measures
Nominal
Ordinal Interval Ratio
Levels of Measurement
and Their Properties
Item
Person
Magnitude
Equal
Interval
Absolute 0
Total
Score
Nominal
No
No
No
0
Ordinal
Yes
No
No
1
Interval
Yes
Yes
No
2
Ratio
Yes
Yes
Yes
3
Four Types of Data Collection Errors
Coverage Error
• Does each person in target population have an equal chance of
selection?
Sampling Error
• Only some members of the target population are sampled.
Nonresponse Error
• Do people in the sample who respond differ from those who do not?
Measurement Error
• Inaccuracy in answers given to survey questions.
12
Characteristics of Good Measures
•
•
•
•
•
Acceptability
Variability
Reliability
Validity
Interpretability
Indicators of Acceptability
• Response rate
• Administration time
• Missing data (item, scale)
Variability
• Responses fall in each response category
• Distribution approximates bell-shaped “normal”
curve (68.2%, 95.4%, and 99.6%)
Reliability
Reliability is the degree to which the same score
is obtained for thing being measured (person,
plant or whatever) when that thing hasn’t
changed.
– Ratio of signal to noise
Observed Score is:
observed
“true”
=
+
score
score
systematic
error
random
+
error
Flavors of Reliability
• Inter-rater (rater)
– Need 2 or more raters of the thing being measured
• Test-retest (administrations)
– Need 2 or more time points
• Internal consistency (items)
– Need 2 or more items
Reliability Minimum Standards
• 0.70 or above (for group comparisons)
• 0.90 or higher (for individual assessment)
 SEM = SD (1- reliability)1/2
 95% CI = true score +/- 1.96 x SEM
 if z-score = 0, then CI: -.62 to +.62 when reliability = 0.90
 Width of CI is 1.24 z-score units
Hypothetical Ratings of Performance of Six
Students in HPM 214 by Two Raters Using
Excellent to Poor Scale
[1 = Poor; 2 = Fair; 3 = Good; 4 = Very good; 5 = Excellent]
1= Julian (Good, Very Good)
2= Narissa (Very Good, Excellent)
3= Alina (Good, Good)
4= Greg (Fair, Poor)
5= Linda (Excellent, Very Good)
6= Caroline (Fair, Fair)
(Target = 6 students; assessed by 2 raters)
Kappa Coefficient of Agreement
(Corrects for Chance)
kappa =
“Quality Index”
(observed - chance)
(1 - chance)
Cross-Tab of Ratings
Rater 1
P
P
0
Rater 2
F
F
G
E
1
1
1
1
VG
1
E
0
VG
1
G
Total
Total
2
2
1
0
1
2
1
0
1
1
1
6
Calculating KAPPA
PC =
Pobs. =
Kappa =
(0 x 1) + (2 x 1) + (2 x 1) + (1 x 2) + (1 x 1)
(6 x 6)
2
6
= 0.33
0.33– 0.19
1 - 0.19
= 0.17
=
0.19
Guidelines for Interpreting Kappa
Conclusion
Kappa
Conclusion
Kappa
Poor
Fair
< .40
Poor
< 0.0
.40 - .59
Slight
.00 - .20
Good
.60 - .74
Fair
.21 - .40
> .74
Moderate
.41 - .60
Substantial
.61 - .80
Almost
perfect
.81 - 1.00
Excellent
Fleiss (1981)
Landis and Koch (1977)
Weighted Kappa
(Linear and Quadratic)
P
F
G
VG
E
P
1
.75 (.937)
.50 (.750)
.25 (.437)
0
F
.75 (.937)
1
.75 (.937)
.50 (.750)
.25 (.437)
G
.50 (.750)
.75 (.937)
1
.75 (.937)
.50 (.750)
VG
.25 (.437)
.50 (.750)
.75 (.937)
1
.75 (.937)
0
.25 (.437)
.5 (.750)
.75 (.937)
1
E
Wl = 1 – ( i/ (k – 1))
W q = 1 – (i2 / (k – 1) 2)
i = number of categories ratings differ by
k = n of categories
Linear weighted kappa = 0.52; Quadratic weighted kappa = 0.77
Intraclass Correlation and Reliability
Model
Reliability
Intraclass Correlation
Oneway
MS BMS  MSW MS
MS BMS
MS BMS  MSW MS
MS BMS  (k  1) MSW MS
Twoway
mixed
MS BMS  MS EMS
MS BMS
MS BMS  MS EMS
MS BMS  (k  1) MS EMS
Two-way
random
N ( MS BMS  MS EMS )
NMS BMS  MS JMS  MS EMS
MS BMS  MS EMS
MS BMS  (k  1) MS EMS  k ( MS JMS  MS EMS ) / N
BMS = Between Ratee Mean Square N = n of ratees
WMS = Within Mean Square
k = n of items or raters
JMS = Item or Rater Mean Square
EMS = Ratee x Item (Rater) Mean Square
26
01 13
01 24
02 14
02 25
03 13
03 23
04 12
04 21
05 15
05 24
06 12
06 22
Two-Way Random Effects
(Reliability of Performance Ratings)
Source
df
Students (BMS)
Raters (JMS)
Stud. x Raters (EMS)
5
1
5
15.67
0.00
2.00
11
17.67
Total
SS
= 0.89
2-way R = 6 (3.13 - 0.40)
6 (3.13) + 0.00 - 0.40
MS
3.13
0.00
0.40
ICC = 0.80
Responses of Students to Two
Questions about Their Health
1= Julian (Good, Very Good)
2= Narissa (Very Good, Excellent)
3= Alina (Good, Good)
4= Greg (Fair, Poor)
5= Linda (Excellent, Very Good)
6= Caroline (Fair, Fair)
(Target = 6 students; assessed by 2 items)
01 34
02 45
03 33
04 21
05 54
06 22
Two-Way Mixed Effects (Cronbach’s Alpha)
Source
df
Respondents (BMS)
Items (JMS)
Resp. x Items (EMS)
Total
Alpha =
SS
5
1
5
15.67
0.00
2.00
11
17.67
3.13 - 0.40 = 2.93 = 0.87
3.13
3.13
MS
3.13
0.00
0.40
ICC = 0.77
Satisfaction of 12 Family Members with
6 Students (2 per student)
1. Julian (fam1: Good, fam2: Very Good)
2. Narissa (fam3: Very Good, fam4: Excellent)
3. Alina (fam5: Good, fam6: Good)
4. Greg (fam7: Fair, fam8: Poor)
5. Linda (fam9: Excellent, fam10: Very Good)
6. Caroline (fam11: Fair, fam12: Fair)
(Target = 6 students; assessed by 2 family
members each)
01 13
01 24
02 34
02 45
03 53
03 63
04 72
04 81
05 95
05 04
06 12
06 22
One-Way ANOVA
(Reliability of Ratings of Students)
Source
df
SS
MS
Respondents (BMS)
Within (WMS)
5
6
15.67
2.00
3.13
0.33
11
17.67
Total
1-way =
3.13 - 0.33 = 2.80 = 0.89
3.13
3.13
Standardized Alpha for Different Numbers of
Items and Average Inter-item Correlation
Average Inter-item Correlation ( r )
Number
of Items (k)
2
4
6
8
.0
.000
.000
.000
.000
.2
.333
.500
.600
.666
Alphast =
.4
.572
.727
.800
.842
.6
.750
.857
.900
.924
k* r
1 + (k -1) * r
.8
1.0
.889 1.000
.941 1.000
.960 1.000
.970 1.000
Spearman-Brown Prophecy Formula
alpha
y
=
(
N • alpha
x
1 + (N - 1) * alpha
x
)
N = how much longer scale y is than scale x
Example Spearman-Brown
Calculations
Estimating the reliability of the
MHI-18 from the MHI-32
18/32 (0.98)
(1+(18/32 –1)*0.98
=
0.55125
0.57125
= 0.96
Number of Items and Reliability:
Three Versions of the
Mental Health Inventory (MHI)
Measure
Number of Completion
Items
Time (min.)
Reliability
.98
MHI-32
32
5-8
MHI-18
18
3-5
.96
MHI-5
5
1 or less
.90
Data from McHorney et al. 1992
Multitrait Scaling Analysis
• Internal consistency reliability
– Item convergence
• Item discrimination
Item-scale correlation matrix
Item #1
Item #2
Item #3
Item #4
Item #5
Item #6
Item #7
Item #8
Item #9
Depress
Anxiety
0.80*
0.80*
0.80*
0.20
0.20
0.20
0.20
0.20
0.20
0.20
0.20
0.20
0.80*
0.80*
0.80*
0.20
0.20
0.20
Anger
0.20
0.20
0.20
0.20
0.20
0.20
0.80*
0.80*
0.80*
*Item-scale correlation, corrected for overlap.
37
Item-scale correlation matrix
Item #1
Item #2
Item #3
Item #4
Item #5
Item #6
Item #7
Item #8
Item #9
Depress
Anxiety
0.50*
0.50*
0.50*
0.50
0.50
0.50
0.50
0.50
0.50
0.50
0.50
0.50
0.50*
0.50*
0.50*
0.50
0.50
0.50
Anger
0.50
0.50
0.50
0.50
0.50
0.50
0.50*
0.50*
0.50*
*Item-scale correlation, corrected for overlap.
38
Validity
• Does instrument measure what it is supposed
to measure?
• A “validated” instrument is a holy grail
Reliability and Validity
Threats to Validity
• Acquiescent Response Set
• Socially Desirable Response Set
Listed below are a few statements about your relationships
with others. How much is each statement TRUE or FALSE for
you?
1. I am always courteous even to people who are
disagreeable.
2. There have been occasions when I took advantage of
someone.
3. I sometimes try to get even rather than forgive and
forget.
4. I sometimes feel resentful when I don’t get my way.
5. No matter who I’m talking to, I’m always a good listener.
Definitely true; Most true; Don’t know;
Mostly false; Definitely false
Two Types of Validity
•
Content Validity
– Includes face validity
•
Construct Validity
– Many synonyms
Content Validity
• Does the measure adequately represent the
domain?
– Do items operationalize concept?
– Do items cover all aspects of concept?
– Does scale name represent item content?
• Face validity is extent to which measure
“appears” to reflect what it is intended to
– E.g., by expert judges or by patient focus groups
Construct Validity
• Do scores on a measure relate to other
variables in ways consistent with hypotheses?
Evaluating Construct Validity
Scale
Age
Obesity
ESRD
Nursing
Home
Resident
Physical
Functioning
Medium (-).
Small (-)
Large (-)
Large (-)
Depressive
Symptoms
?
Small (+)
?
Medium (+)
Cohen effect size rules of thumb (d = 0.2, 0.5, and 0.8):
Small correlation = 0.100
Medium correlation = 0.243
Large correlation = 0.371
r = d / [(d2 + 4).5] = 0.8 / [(0.82 + 4).5] = 0.8 / [(0.64 + 4).5] = 0.8 / [( 4.64).5] =
0.8 / 2.154 = 0.371
(Beware r’s of 0.10, 0.30 and 0.50 are often cited as small, medium, and
large.)
Relative Validity Analyses
• Form of "known groups" validity
• Relative sensitivity of measure to important
clinical difference
• One-way between group ANOVA
Relative Validity Example
Severity of Heart Disease
None
Mild
Severe
F-ratio
Relative
Validity
Scale #1
87
90
91
2
--
Scale #2
74
78
88
10
5
Scale #3
77
87
95
20
10
Responsiveness to Change
• HRQOL measures should be responsive to
interventions that changes HRQOL
• Need external indicators of change (Anchors)
Self-Report Indicator of Change
• Overall has there been any change in your asthma
since the beginning of the study?
Much improved;
Moderately improved;
Minimally improved
No change
Minimally worse;
Moderately worse;
Much worse
Clinical Indicator of Change
• “changed” group = seizure free
(100% reduction in seizure frequency)
• “unchanged” group = <50% change
in seizure frequency
Responsiveness Indices
(1) Effect size (ES) = D/SD
(2) Standardized Response Mean (SRM) = D/SD†
(3) Guyatt responsiveness statistic (RS) = D/SD‡
D = raw score change in “changed” group;
SD = baseline SD;
SD† = SD of D;
SD‡ = SD of D among “unchanged”
Effect Size Benchmarks
• Small: 0.20->0.49
• Moderate: 0.50->0.79
• Large: 0.80 or above
Minimally Important
Difference (MID)
• External anchors
– Self-report
– Provider report
– Clinical measure
– Intervention
• Anchor correlated with change on target
measure at 0.371 or higher
• Anchor indicates “minimal” change
Change in Physical Function
Baseline = 100 (U.S. males mean = 87, SD = 20)
• Hit by Bike causes me to be limited a lot in vigorous
activities, limited a little in moderate activities, and
limited a lot in climbing several flights of stairs.
Physical functioning drops to 75 (-1.25 SD)
• Hit by Rock causes me to be limited a little in vigorous
activities and physical functioning drops to 95 (- 0.25
SD)
Example with Multiple Anchors
• 693 RA clinical trial participants evaluated at baseline and 6weeks post-treatment.
• Five anchors:
1.
2.
3.
4.
5.
patient global self-report;
physician global report;
pain self-report;
joint swelling;
joint tenderness
Kosinski, M. et al. (2000). Determining minimally important changes in generic and diseasespecific health-related quality of life questionnaires in clinical trials of rheumatoid arthritis.
Arthritis and Rheumatism, 43, 1478-1487.
Patient and Physician
Global Reports
How are you (is the patient) doing, considering all the ways that
RA affects you (him/her)?
•
•
•
•
Very good (asymptomatic and no limitation of normal activities)
Good (mild symptoms and no limitation of normal activities)
Fair (moderate symptoms and limitation of normal activities)
Poor (severe symptoms and inability to carry out most normal
activities)
• Very poor (very severe symptoms that are intolerable and
inability to carry out normal activities
--> Improvement of 1 level over time
Global Pain, Joint Swelling
and Tenderness
• 0 = no pain, 10 = severe pain
• Number of swollen and tender joints
-> 1-20% improvement over time
Effect Sizes (mean = 0.34) for SF-36
Changes Linked to Minimal Change in Anchors
Scale
Self-R Clin.-R
Pain
Swell
Tender
Mean
PF
.35
.33
.34
.26
.32
.32
Role-P
.56
.52
.29
.35
.36
.42
Pain
.83
.70
.47
.69
.42
.62
GH
.20
.12
.09
.12
.04
.12
EWB
.39
.26
.25
.18
.05
.23
Role-E
.41
.28
.18
.38
.26
.30
SF
.43
.34
.28
.29
.38
.34
EF
.50
.47
.22
.22
.35
.35
PCS
.49
.48
.34
.29
.36
.39
MCS
.42
.27
.19
.27
.20
.27
Appendix-ANOVA Computations
• A. Student’s SS
(72+92+62+32+92+42)/2 – 382/12 = 15.67
• B. Rater/Item SS
(192+192)/6 – 382/12 = 0.00
• C. Total SS
(32+ 42+42+52+32+32+22+12+52+42+22+22) – 382/10 =
17.67
• Student x Item SS= A – (B + C SS)
options ls=130 ps=52 nocenter;
options nofmterr;
data one;
input id 1-2 rater 4 rating 5;
CARDS;
01 13
01 24
02 14
02 25
03 13
03 23
04 12
04 21
05 15
05 24
06 12
06 22
;
run;
**************;
proc freq;
tables rater rating;
run;
*******************;
proc means;
var rater rating;
run;
*******************************************;
proc anova;
class id rater;
model rating=id rater id*rater;
run;
*******************************************;
data one;
input id 1-2 rater 4 rating 5;
CARDS;
01 13
01 24
02 14
02 25
03 13
03 23
04 12
04 21
05 15
05 24
06 12
06 22
;
run;
***************************************************************
***;
%GRIP(indata=one,targetv=id,repeatv=rater,dv=rating,
type=1,t1=test of GRIP macro,t2=);
GRIP macro is available at: http://gim.med.ucla.edu/FacultyPages/Hays/util.htm
data one;
input id 1-2 rater1 4 rater2 5;
control=1;
CARDS;
01 34
02 45
03 33
04 21
05 54
06 22
;
run;
**************;
DATA DUMMY;
INPUT id 1-2 rater1 4 rater2 5;
CARDS;
01 11
02 22
03 33
04 44
05 55
RUN;
DATA NEW;
SET ONE DUMMY;
PROC FREQ;
TABLES CONTROL*RATER1*RATER2
/NOCOL NOROW NOPERCENT AGREE;
*******************************************;
data one;
set one;
*****************************************;
proc means;
var rater1 rater2;
run;
*******************************************;
proc corr alpha;
var rater1 rater2;
run;
Download