Here

advertisement
QSU Seminar
Reliability and Validity:
Design and Analytic Approaches
Practical Considerations
Rita Popat, PhD
Dept of Health Research & Policy
Division of Epidemiology
rpopat@stanford.edu
What do we want to know about the measurements? Why?
Dependent variable (outcome)
Independent variable (risk factor or predictor)
JAMA. 2004; 292:1188-1194
2
What are other possible explanations for not detecting an
association?
JAMA. 2004;291:1978-1986
4
Outline
• Definitions: Measurement error, reliability, validity
• Why should we care about measurement error?
• Effects of measurement error on Study Validity
(Categorical Exposures)
• Effects of measurement error on Study Validity
(Continuous Exposures)
• Measures (or indices) for reliability and validity
Measurement error
• For an individual, measurement error is the difference
between his/her observed and true measurement.
• Measurement error can occur in dependent (outcome)
or independent (predictor or exposure) variables
• For categorical variables, measurement error is
referred to as misclassification
• Measurement error is an important source of bias that
can threaten internal validity of a study
6
Reliability
(aka reproducibility, consistency)
• Reliability is the extent to which repeated
measurement of a stable phenomenon- by the same
person or different people and instruments, at different
times and places- obtain similar results.
• A precise measurement is reproducible, that is, has
the same (or nearly the same) value each time it is
measured.
• The higher the reliability, the greater the statistical
power for a fixed sample size
• Reliability is affected by random error
7
Validity or Accuracy
• The accuracy of a variable is the degree to
which it actually represents what it is intended to
represent
• That is: The extent to which the measurement
represents the true value of the attribute being
assessed.
8
Precise (Reliable) and Accurate (Valid)
measurements are key to minimizing
measurement error
Precision, no accuracy
No precision, no accuracy
Accuracy, low precision
Precision and accuracy
9
Measurement error in Categorical
Variables
• Referred to as Misclassification and could be in the
• Outcome variables, or
• Exposure variables
• How do we know misclassification exists?
• When method used for classifying exposure lacks
accuracy
Assessment of Accuracy
Criterion validity (compare against a reference or
gold standard)
True classification
Imperfect
Classification
Present
+
Present
-
Sensitivity = a / (a+c)
Specificity = d / (b+d)
Present
+
a
TP
c
FN
a+c
Absent
b
FP
d
TN
b+d
a+b
c+d
False negative = c / (a+c)
False positive = b / (b+d)
11
Misclassification of exposure
Exposure +
Cases
(outcome +)
a
Controls
(outcome -)
b
Exposure -
c
d
• Non-differential
• Differential
12
Misclassification of exposure
True exposure
Cases
Controls
Exposure +
50
20
Exposure -
50
80
(50)(80)
OR 
 4.0
(20)(50)
Reported exposure:90% sensitivity & 80% specificity in cases & controls
Cases
Controls
Exposure +
55
34
Exposure -
45
66
(55)(66)
OR 
 2.4
(34)(45)
Attenuation of true association due to misclassification of exposure
13
Misclassification of the exposure
Exposure +
Cases
(outcome +)
a
Controls
(outcome -)
b
Exposure -
c
d
• Non-differential misclassification occurs when the
degree of misclassification of exposure is independent of
outcome/disease status
• Tends to bias the association toward the null
• Occurs when the sensitivity and specificity of the classification
of exposure are same for those with and without the outcome
but less than 100%
14
Underestimation of a
relative risk
or odds ratio for…
0
0
Observed
value
True
value
1
A. Risk factor
2
1
B. Protective factor
2
Bias toward the null
hypothesis
True
value
Bias toward the null
hypothesis
Observed
value
Modified from Greenberg. Fig 10-4, chapter 10
Misclassification of the exposure
True exposure
Cases
Controls
Exposure +
50
20
Exposure -
50
80
Reported exposure:
Cases
(50)(80)
OR 
 4.0
(20)(50)
Cases - 96% sensitivity and 100% specificity
Controls- 70% sensitivity and 100% specificity
Controls
Exposure +
48
14
Exposure -
52
86
(48)(86)
OR 
 5 .7
(52)(14)
16
Misclassification of the exposure
True exposure
Cases
Controls
Exposure +
50
20
Exposure -
50
80
Reported exposure:
Cases
(50)(80)
OR 
 4.0
(20)(50)
Cases - 96% sensitivity and 100% specificity
Controls- 70% sensitivity and 80% specificity
Controls
Exposure +
48
30
Exposure -
52
70
(48)(70)
OR 
 2 .1
(52)(30)
17
Misclassification of the exposure
Exposure +
Cases
(outcome +)
a
Controls
(outcome -)
b
Exposure -
c
d
Differential misclassification occurs when the degree
of misclassification differs between the groups being
compared.
• May bias the association either toward or away from the null
hypothesis
• Occurs when the sensitivity and specificity of the classification
of exposure differ for those with and without the outcome
18
Overestimation of a
relative risk
or odds ratio for…
0
Bias away from the null
hypothesis
True
value
Observed
value
1
A. Risk factor
2
Bias away from the
null hypothesis
Observed
value
0
True
value
1
B. Protective factor
Modified from Greenberg. Fig 10-4, chapter 10
2
Hormone therapy
Cases
Controls
Index
Index
Proxy
(~25%)
Proxy
(~25%)
Never
Former
Current
accuracy?
Pharmacy
database
Summary so far….
• Misclassification of exposure is an important source
of bias
• Good to know something about the validity of
measurement for exposure classification before the
study begins
• Almost impossible to avoid misclassification, but try
to avoid differential misclassification
• If the study has already been conducted, develop
analytic strategies that explore exposure
misclassification as a possible explanation of the
observed results (especially for a “primary”
exposure of interest)
Measurement error in Continuous
Variables
• Physiologic measures (SBP, BMI)
• Biomarkers (hormone levels, lipids)
• Nutrients
• Environmental exposures
• Outcome measures (QOL, function)
Model of measurement error
Measurement Theory: Example contd..
Measurement error
Ti
Ti + b
Ti + b + Ei
_________ b _______ ______________ E _______________
+ systematic error in
+ additional "random error" for subject i
instrument (bias)
EXAMPLE: One m easured diast olic blood pressure (DBP) as indicat or of 2-year average
DBP.

BD cuff miscalibrated -measures everyone's diastolic
BP as 10 mm Hg less

+ randomness in BP cuff mechanics
+ subject i's 10 mmHg increase over 2-year average
+ subject intimidated - diastolic BP 20 mmHg higher than u
+ misreading by interviewer
+ random fluctuations in current BP
24
Measurement theory
25
Validity of X…
Measurement error
Differential Measurement error
OR
Differential Measurement error
OR
Differential Bias
Non- Differential Measurement error
The effects of non-differential measurement error on the odds
ratio. ORT is the true odds ratio for exposure versus reference
level r. ORO is the observable odds ratio for exposure versus
reference level r.
Effects of non-differential measurement error
2
1 /  XT
O
RRT  RR
 0.821/ 0.62  0.73
Summary so far….
• Measurement error is an important source of bias
• Good to know something about the validity of
measurement for exposure before the study begins
• Almost impossible to avoid misclassification, but try
to avoid differential misclassification!
• Non-differential measurement error will attenuate
the results towards the null, resulting in loss of
power for a fixed sample size
• This should be taken into account when estimating sample
size during the planning stage and
• Interpretation of results and determining internal validity
of a study
So why should we evaluate reliability
and validity of measurements?
• If it precedes the actual study, it tells us whether
the instrument/method we are using is reliable and
valid
• This information can help us run sensitivity analysis
or correct for the measurement error in the
variables after the study has been completed
Outline
• Definitions: Measurement error, reliability, validity
• Why should we care about measurement error?
• Effects of measurement error on Study Validity
(Categorical Exposures)
• Effects of measurement error on Study Validity
(Continuous Exposures)
• Measures (or indices) for reliability and validity
Choice of reliability and validity measures depend
on type of variable . . .
Type of Variable
Reliability Measure(s) Validity Measure(s)
Dichotomous
Kappa
Ordinal
weighted kappa
ICC*
Continuous
ICC *
Bland Altman Plots
sensitivity,
specificity
misclassification
matrix
Pearson
correlation
(see note)
Bland-Altman Plots
*ICC – intraclass correlation coefficient
Note: in inter-method reliability studies, inferences about validity can be
made from coefficients of reproducibility (such as the Pearson’s correlation )
38
Assessing Accuracy (Validity) of
continuous measures
• Bias: difference between the mean value as
measured x and the mean of the true values X
• So bias = x – X
• Standardized bias = x  X
SDX
• Bland-Altman plots
39
Bland and Altman plots
• Take two measurements (different methods or
instrument) on the same subject
• For each subject, plot the difference b/w the
two measures (y axis) vs. the mean of the two
measures
• We expect the mean difference to be 0
• We expect 95% of the differences to be within 2
standard deviations (SD)
Yoong et al. BMC Medical Research Methodology 2013, 13:38
Yoong et al. BMC Medical Research Methodology 2013, 13:38
Suppose there is no gold standard, then how
do we evaluate validity?
…..
We make inferences from inter-method
reliability studies!
Note: will not be able to estimate bias when
the two measures are based on different
scales
Inferences about validity from inter-method
reliability studies
• Suppose two different methods (instruments) are used to
measure the same continuous exposure. Let X1 denote the
measure of interest (i.e., the one to be used to measure the
exposure in the study) and X2 is the comparison measure
• We have the reliability coefficient
x x
1 2
• However, we are actually interested in the validity coefficient:
Tx
1
• Example: Is self-reported physical activity valid? Compare it to
the 4-week diary.
44
Relationship of Reliability to Validity
Errors of X1 and X2 are:
Relationship b/w reliability
and validity
Usual application
1. Uncorrelated and both
measures are equally
precise
Tx  Tx   x x
Intramethod study
2. Uncorrelated , X2 is
more precise than X1
 x x  Tx   x x
Intermethod study
3. Uncorrelated , X1 is
more precise than X2
4. Correlated errors and
both measures are
equally precise
1
2
1 2
1 2
1
1 2
Tx   x x
1
1 2
Tx   x x
1
Intermethod study
Intramethod study
Intermethod study
1 2
Take home message: In most situations the square root of the reliability
coefficient can provide an upper limit to the validity coefficient
45
Inferences about validity from inter-method
reliability studies
• In our example, X1 is measure of interest (i.e., the one to be
used to measure the exposure in the study: self-reported
activity) and X2 is the comparison measure (4-wk diaries)
• We have the reliability coefficient
x x
1 2
= 0.79
• Errors in X1 and X2 are likely to be uncorrelated and X2 is more
precise than X1, so
0.79 <
Tx
1
< 0.89
• So, self-reported activity appears to be a valid measure
46
Summary of Inferences From
Reliability to Validity
• Reliability studies are used to interpret validity of x.
• Reliability is necessary for validity (instrument cannot be
valid if it is not reproducible).
• Reliability is not sufficient for validity - repetition of test
may yield same result because both X1 and X2 measure
some systematic error (i.e., errors are correlated).
• Reliability can only give an upper limit on validity. If the
upper limit is low, then the instrument is not valid.
• An estimate of reliability (or validity) depends on the
sample (i.e., may vary by age, gender, etc.)
47
Reliability of continuously distributed
variables
• Pearson product-moment correlation?
• Spearman rank correlation?
48
But…does correlation tell you about
relationship or agreement?
185
180
175
Measure 2
Measure 1 Measure 2
150
155
155
158
160
165
163
170
170
176
174
184
190
170
165
160
155
150
145
150
155
160
165
170
175
180
Measure 1
Pearson’s Correlation coefficient=0.99
Is this measure reliable?
49
Reliability of continuously distributed
variables
• Other methods generally preferred for intra or interobserver reliability when same method/instrument is
used
- Intraclass correlation coefficients (ICC): is calculated
using variance estimates obtained through an
analysis of variance (ANOVA)
- Bland-Altman plots
• Correlation coefficient useful for inter-method reliability
to make inferences about validity (especially when the
measurement scale differs for the two methods)
50
Intraclass Correlation Coefficient (ICC)
Total variance
ICC =
Within person
Between person
Between person variance
Total variance
• If within-person variance is very high, then measurement error
can "overwhelm" the measurement of between person
differences.
• If between-person differences are obscured by measurement
error, it becomes difficult to demonstrate a correlation between
the imperfectly measured characteristic and any other variable
of interest.
• ICC is computed using ANOVA
51
ANalysis Of Variance (ANOVA)
in a reliability study
In a reliability study, we are not studying associations
b/w predictors and outcome, so we will express the
overall variability in the measurement as a function
of between-subjects and within-subjects
variability
SST =
SSB +
SSW
• So let’s consider a test-retest reliability study, where
multiple measurements are taken for each subject
52
Total Variation
SST  ( X 11  X )  ( X 12  X )  ...  ( X k n  X )
2
2
2
n
Response, X
X
Subject
1 1
Group
Subject
Group22
Subject33
Group
53
Between-Subject Variation
SSB  k1 ( X1  X )2  k2 ( X 2  X )2  ...  kn ( X n  X )2
Where: k1= number of measurements taken on subject 1
Response, X
X3
X2
X1
Subject
Group 11
Subject22
Group
X
Subject
Group
3 3
54
Within-Subject Variation
(continued)
SSW  ( X 11  X 1 )  ( X 12  X 2 )  ...  ( X k n  X n )
2
2
2
n
Response, X
X 12 X 13
X3
X11
X1
Group 11
Subject
Group
Subject22
X2
Group
3 3
Subject
55
One way analysis of variance for computation of ICC: testretest study
Source of variance
Sum of squares
(SS)
Degrees of
freedom (df)
Mean square
(MS=SS/df)
n-1
BMS

ij  X i
n (k –1)
WMS

ij  X
nk - 1
k  X i  X 
2
Between subjects
Within subjects
(random error)
Total
i
  X
i
j
  X
i
2
2
j
Here, each subject is a group.
k=# times measure is repeated
ICC  ˆ x 
BMS  WMS
BMS  ( k  1)WMS
Interpretation of ICC
ICC  ˆ x 
BMS  WMS
BMS  ( k  1)WMS
• If within-person variance is very high, then measurement error
can "overwhelm" the measurement of between person
differences.
• If between-person differences are obscured by measurement
error, it becomes difficult to demonstrate a correlation between
the imperfectly measured characteristic and any other variable
of interest.
57
Interpretation of ICC
• The ICC ranges b/w 0 and 1 and is a measure of
reliability adjusted for chance agreement
• An ICC of 1 is obtained when there is perfect
agreement and in general a higher ICC is obtained
when the within-subject error (i.e., random error) is
small.
• Hence, ICC=1 only when there is exact agreement between
measures (i.e., Xi1=Xi2=...Xik for each subject).
• Generally, ICCs greater than 0.7 are considered to
indicate good reliability.
58
Two-way fixed effects ANOVA for computation of ICC
(inter-rater reliability)
Source of variance
Mean square
Sum of squares
(SS)
Degrees of
freedom (df)
(MS=SS/df)
k  X i  X 
n-1
SMS
Between measures
n  X j  X 
k-1
MMS
Within subjects
(random error)
  X
2
Between subjects
i
2
j
i
Total

ij  X
2
EMS
j
by subtraction
ˆ x 
(n-1)(k-1)
nk - 1
n( SMS  EMS )
nSMS  ( k  1) MMS  ( n  1)( k  1) EMS
59
Measuring reliability of categorical
variables
• Percent agreement or concordance rate
• Kappa statistic
60
Reliability of categorical variables
• Concordance rate is the proportion of
observations on which the two observers agree
• Example: Agreement matrix for radiologists
reading mammography for breast cancer
Radiologist A
Radiologist B
Yes
+
No
-
Yes
+
No
-
a
b
a+b
c
d
c+d
a+c
b+d
Overall % agreement = (a+d) / (a+b+c+d)
61
Concordance rates: limitations
• Considerable agreement could be expected by
chance alone.
• Misleading when the observations are not
evenly distributed among the categories (i.e.,
when the proportion “abnormal” on a
dichotomous test is substantially different from
50%)
So, what reliability measures should we use?
62
Kappa
• Kappa is another measurement of reliability
• Kappa measures the extent of agreement
beyond that would be expected by chance
alone
• Can be used for binary or variables with >2
levels
63
Cohen’s Kappa ( ): some notation
• A reliability study in which n subjects have each
been measured twice where each measure is a
nominal variable with k categories.
• It is assumed that the two measures are equally
accurate.
•  is a measure of agreement that corrects for the
agreement that would be expected by chance.
64
Cohen’s Kappa
Table. Layout of data for computations of Cohen’s  and weighted 
1
1
p11
Measure 1 2
p21
(or Rater 1)
.
.
.
k
pk1
Total c1
Measure 2 (or Rater 2)
2
.
.
k
Total
p12 .
p1k r1
.
p22 .
p2k r2
.
.
.
.
.
.
.
.
.
.
.
.
pk2 .
pkk rk
.
c2 .
ck
1
.
65
Cohen’s Kappa
Table. Layout of data for computations of Cohen’s  and weighted 
Measure 2
1
2
.
.
k
Total
1
p11
p12 .
p1k
r1
.
Measure 1 2
p21
p22
p2k
r2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
k
pk1
pk2
pkk
rk
.
.
Total c1
c2
ck
1
.
.
The observed proportion of agreement, Po, is the sum of the
proportions on the diagonal:
k
Po   pii
i 1
66
Cohen’s Kappa
Table. Layout of data for computations of Cohen’s  and weighted 
Measure 2
1
2
.
.
k
Total
1
p11
p12 .
p1k
r1
.
Measure 1 2
p21
p22
p2k
r2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
k
pk1
pk2
pkk
rk
.
.
Total c1
c2
ck
1
.
.
The expected proportion of agreement (on the diagonal), Pe, is:
k
Pe   ri ci
1
Where ri and ci are marginal iproportions
for the 1st and 2nd measure
respectively.
67
Kappa
Then, kappa is estimated by:
Po  Pe
ˆ 
1  Pe
Which is:
Observed agreement(%)-Expected agreement (%)
100% - Expected agreement (%)
• 1  Pe = maximum possible nonchance agreement or 100% less the
contribution of chance
• Po  Pe = proportion of observations that can be attributed to
reliable measurement (i.e., not due to chance)
• So kappa is the ratio of the number of observed nonchance
agreements to the number of possible nonchance agreements
68
Pictorial of kappa statistic
agreement expected
by chance
0
potential improvement
beyond chance
100%
agreement
observed agreement
Kappa = % of maximum possible improvement over that
expected by chance alone (kappa 0.50 here)
69
Kappa
Po  Pe
ˆ 
1  Pe
• Kappa ranges from –1 (perfect disagreement) to +1
(perfect agreement)
• Kappa of 0 means that: observed agreement =
expected agreement
70
Reliability of categorical variables
• Example 1: Agreement matrix for radiologists
reading mammography for breast cancer
Radiologist A
Radiologist B
Yes
+
No
-
Yes
+
No
-
21 (a)
43 (b)
64
3 (c)
83 (d)
86
24
126
Overall % agreement = (a+d) / (a+b+c+d)=(21+83)/150=0.69
Po  Pe 0.69  0.55
ˆ 

 0.31
1  Pe
1  0.55
71
Download