Chapter 3-13. Sensitivity Analysis (Bias Analysis)

advertisement
Chapter 3-13. Sensitivity Analysis (Bias Analysis)
In this chapter, we discuss using sensitivity analysis as a way to address misclassification error.
We cannot just recode the data, because that would be guessing. But we can discover what the
impact of a possible misclassification error would have on the result.
Sensitivity analysis provides “what if” senarios. For example, “Assuming that the outcome
variable is misclassifed 10% of time, so that 10% of the cases were missed, what would the odds
ratio be if this misclassification had not occurred.”
A chapter on this subject, written by Greenland, is found in the Rothman and Greenland text
(1998, chapter 19), updated in Rothman, Greenland, and Lash (2008, chapter 19).
Greenland (1998, p. 343) advises “Potential biases due to unmeasured confounders, classification
errors, and selection bias need to be addressed in any thorough discussion of study results.”
Basic Approach
In a sensitivity analysis, the investigator imputes what the data would be if the bias did not occur
and then analyzes the imputed data.
In this chapter, we apply this basic approach to two types of bias:
1) verification bias (diagnostic test problem)
2) misclassification bias
_____________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2010.
Chapter 3-13 (revision 16 May 2010)
p. 1
Correction for Verification Bias Problem in Evaluating a Diagnostic Test
Verification bias is a problem in cohort studies where the screened positive cases are more
completely verified for true disease status than are the screened negative cases. That is, the gold
standard variable is collected a greater proportion of the time when the diagnostic test is positive
than when it is negative. Other terms for verification bias are work-up bias, referral bias,
selection bias, and ascertainment bias. (Pepe, 2003, pp.168-169).
The bias introduces a sensitivity estimate that is too high and a specificity that is too low, when
using ordinary formulas (naïve estimates) for these test characteristics based on your sample data.
You can quote Pepe (2003, p.169) as a citation for this:
“When screen positives are more likely to be verified for disease than screen negatives,
the bias in naïve estimates is always to increase sensitivity and to decrease specificity
from their true values.”
Pepe (2003, p. 168) gives the following example.
The table on the left shows data for a cohort where all screens are verified with the gold
standard. On the right, all screen positives are verified but only 10% of screen negatives
are verified.
Fully observed
Disease = 1 Disease=0
(gold)
(gold)
Screen = 1 40
95
(test)
Screen = 0 10
855
(test)
50
950
Selected data
Disease = 1 Disease=0
135
40
95
135
865
1
85
86
1000
41
180
221
Fully observed:
Sensitivity = True Positive Fraction (TPF) = 40/50 = 80%
Specificity = 1 – FPF = 855/950 = 90%
(Bias) naïve estimates based on selected data
Sensitivity = TPF = 40/41 = 97.6%
Specificity = 1 – FPF = 85/180 = 47.2%
In this example, the naïve sample estimates are biased, with sensitivity being too high and
specificity being too low, which is consistent with the known direction of this bias. If the
estimates were not biased, the sample (selected data table), which is assumed representative of
the population (fully observed table), should provide estimates that accurately reflect the
population values.
Chapter 3-13 (revision 16 May 2010)
p. 2
Inverse Probability Weighting/Imputation
One approach to this problem is to use the Beggs and Greenes bias adjusted estimates, which
applies Bayes Theorem (Begg and Greenes, 1983; Pepe, 2003).
More simply, we can also get an unbiased estimate by re-creating the fully observed table from
the selected data based on the probability of verification of the screened result. This is called the
Inverse Probability Weighting (IPW) approach. The IPW and Beggs-Green approaches provide
identical unbiased estimates.
Defining a variable V, which is an indicator variable for disease verification status (Pepe, 2003,
p.169),
1 if D is ascertained

V 

0 if D is not ascertained 
we then multiply every cell in the selected data table by
1
, which is the inverse of the estimated selection probability.
P̂[V  1| Screen result]
This formula is just a fancy way to say multiple the observed table frequencies by 1/(proportion
for which disease is verified).
The result is called the inverse probability weighted table or the imputed data table (Pepe, 2003,
p.171).
In the example,
P̂[V  1| Screen  1] = 1.0 , since all screened positives were verified
P̂[V  1| Screen  0] = 0.1 , since 10% of screened negatives were verified.
Inverse weighting the cells of the selected data table,
S=1
S=0
Selected data
D = 1 D=0
40
95
1
85
41
180
135
86
221
× 1/1.0
× 1/0.1
Imputed data
D = 1 D=0
40
95
10
850
50
945
135
860
995
and then calculating the test characteristics using ordinary formulas,
Sensitivity[IPW] = 40/50 = 80%
Specificity[IPW] = 850/945 = 90%
which are equal to the original population values, and so are unbiased.
Chapter 3-13 (revision 16 May 2010)
p. 3
Confidence intervals can be computed for these IPW estimates using a formula by Beggs and
Greenes, which uses the observed sample size, rather than the imputed sample size (Begg and
Greenes, 1983; Pepe, 2003). Pepe (2003, p.174), however, suggests using a bootstrap approach
to obtain the confidence interval, because the Beggs and Greenes formula does not appear to be
reliable for sample sizes that investigators normally collect.
Software
I programmed this for you. The Begg and Greenes unibased estimates for sensitivity and
specificity, as well as the IPW estimates, along with the Begg-Greenes confidence interval, can
be computed using the Stata ado-file, begggreenes.ado found in ado files subdirectory].
Note: Using begggreenes.ado
In the command window, execute the command
sysdir
This tells you the directories Stata searches to find commands, or ado files.
It will look like:
STATA: C:\Program Files\Stata10\
UPDATES: C:\Program Files\Stata10\ado\updates\
BASE: C:\Program Files\Stata10\ado\base\
SITE: C:\Program Files\Stata10\ado\site\
PLUS: c:\ado\plus\
PERSONAL: c:\ado\personal\
OLDPLACE: c:\ado\
I suggest you copy the file begggreenes.ado and begggreenes.hlp from the ado files subdirectory
to the c:\ado\personal\ directory. Alternatively, you can simply make sure these two files are in
your working directory (the directory shown in bottom left-corner of Stata screen). Having done
that, begggreenes becomes an executable command in your installation of Stata. If the directory
c:\ado\personal\ does not exit, then you should create it using Windows Explorer (My
Documents icon), and then copy the two files into this directory. The directory is normally
created by Stata the first time you update Stata.
To get help for begggreenes, use help begggreenes in the command window.
To execute, use the command begggreenes followed by the two required variable names and
three options.
Chapter 3-13 (revision 16 May 2010)
p. 4
The syntax is found in the help file.
help begggreenes
Syntax for begggreenes
---------------------------------------------------------------------[by byvar:] begggreenes yvar dvar [if] [in] , cohortsize( ) pv1( ) pv0( )
where yvar is name of dichotomous test variable
dvar is name of dichotomous disease variable (gold standard)
cohortsize(n), where n= size of study cohort
pv1(x), where x=number between 0 and 1 is the proportion of
the yvar=1 subjects in the cohort that have nonmissing
dvar (have verification of disease)
pv0(y), where y=number between 0 and 1 is the proportion of
the yvar=0 subjects in the cohort that have nonmissing
dvar (have verification of disease)
Note: the two variables and 3 options are required.
Description
----------begggreenes computes the Begg and Greenes (1983) unbiased estimators for
sensitivity and specificity, along with both asymptotic and bootstrapped CIs.
Reference
--------Begg CB, Greenes RA. Assessment of diagnostic tests when disease
verification is subject to selection bias. Biometrics 1983;39:207-215.
Example
-------begggreenes yvar dvar , cohortsize(1000) pv1(1) pv0(.1)
Chapter 3-13 (revision 16 May 2010)
p. 5
To obtain the statistics discussed in the example, first bring the data into Stata using
clear
input yvar dvar count
1 1 40
1 0 95
0 1 1
0 0 85
end
expand count
drop count
Then, to compute the statistics, use
begggreenes yvar dvar , cohortsize(1000) pv1(1) pv0(.1)
Sample Data
test
disease (gold)
+
---------------+ |
40
95 |
135
- |
1
85 |
86
----------------------41
180 |
221
Imputed Inverse Probability Weighting Population Data
disease (gold)
+
---------------test
+ |
40
95 |
135
- |
10
850 |
860
----------------------50
945 |
995
Sensitivity (Begg & Greenes) = 0.7991
95% CI (0.3571 , 0.9661)
Specificity (Begg & Greenes) = 0.9000
95% CI (0.8791 , 0.9176)
Sensitivity (Inverse Probability Weighted) =
Specificity (Inverse Probability Weighted) =
Cohort N =
1000
Proportion cohort with positive test disease
Proportion cohort with negative test disease
0.8000
0.8995
verified = 1.0000
verified = 0.1000
These results agree with the results in the above text, as well as agree with those shown in Pepe
(2003) where she presented this example, which should give you some confidence that it was
programmed correctly.
Chapter 3-13 (revision 16 May 2010)
p. 6
To get bootstrapped confidences intervals, as suggested by Pepe (2003), use the following
command. It will use four bootstrapping methods. The most popularly reported approach is the
bias-corrected CI, although the bias-corrected and accelerated CI is supposed to be superior.
bootstrap r(unbiased_sensitivity_BG) r(unbiased_specificity_BG), ///
reps(1000) size(221) seed(999) bca: ///
begggreenes yvar dvar , cohortsize(1000) pv1(1) pv0(.1)
estat bootstrap, all
Bootstrap results
command:
_bs_1:
_bs_2:
Number of obs
Replications
=
=
221
1000
begggreenes yvar dvar, cohortsize(1000) pv1(1) pv0(.1)
r(unbiased_sensitivity)
r(unbiased_specificity)
-----------------------------------------------------------------------------|
Observed
Bootstrap
|
Coef.
Bias
Std. Err. [95% Conf. Interval]
-------------+---------------------------------------------------------------_bs_1 |
.79907085
.0366465
.15247002
.5002351
1.097907
(N)
|
.5238683
1
(P)
|
.4854757
1 (BC)
|
.3863797
1 (BCa)
_bs_2 |
.89999388
5.30e-06
.0074108
.885469
.9145188
(N)
|
.8849489
.9142065
(P)
|
.8843828
.9136888 (BC)
|
.8843828
.9136316 (BCa)
-----------------------------------------------------------------------------(N)
normal confidence interval
(P)
percentile confidence interval
(BC)
bias-corrected confidence interval
(BCa) bias-corrected and accelerated confidence interval
Article Suggestion
Here is an suggestion for reporting this approach in the Statistical Methods section of your
article.
Given that 100% of the patients who tested positive on the screening test had their disease
verified using the gold standard test, while only 10% of the patients who tested negative on the
screening test had their disease verified, ordinary estimates of sensitivity and specificity are
subject to verification bias (Pepe, 2003). Therefore, we report Begg and Greenes estimates of
sensitivity and specificity, where the estimates are corrected for verification bias using a Bayes
Theorem approach (Begg and Greenes, 1983; Pepe, 2003). Pepe has shown the asymptotic
confidence intervals to be unreliable for sample sizes used in research studies and instead
recommends bootstrapped confidence intervals (Pepe, 2003). Thus, we report boostrapped
confidence intervals using the “bias-corrected” method (Carpenter and Bithell, 2000), where
“bias” used in this sense is not referring to verification bias, but rather is making the confidence
intervals closer to their expected value.
Chapter 3-13 (revision 16 May 2010)
p. 7
Misclassification Bias
In a misclassification problem, the following probabilities are useful for quantifying the
misclassification error (Greenland, 1998, p.347):
Exposure (predictor variable) Classification
Sensitivity = probability someone exposed is classified as exposed
= Prob(data: exposed = 1 | truth: exposed = 1)
Specificity = probability someone unexposed is classified as unexposed
= Prob(data: exposed = 0 | truth: exposed = 0)
Disease (outcome variable) Classification
Sensitivity = probability someone diseased is classified as diseased
= Prob(data: diseased = 1 | truth: diseased = 1)
Specificity = probability someone unexposed is classified as unexposed
= Prob(data: diseased = 0 | truth: diseased = 0)
These definitions are identically the standard definitions for sensitivity and specificity when
comparing a diagnostic test to a gold standard (see box).
In Greenland’s chapter, he gives a method that involves solving a system of simultaneous
equations to derive the adjusted cell frequencies. After that, you use standard methods for
calculating the odds ratio and confidence interval.
This approach is available in PEPI and in Stata.
Chapter 3-13 (revision 16 May 2010)
p. 8
Test Characteristics: Sensitivity and Specificity
With the data in the required form for Stata’s diagt command:
Gold Standard “true value”
disease present ( + )
disease absent ( - )
Test “probable value”
disease present ( + )
disease absent ( - )
a (true positives)
b (false negatives)
c (false positives)
d (true negatives)
a+c
b+d
a+b
c+d
We define the following terminology expressed as percents:
sensitivity = (true positives)/(true positives plus false negatives)
= (true positives)/(all those with the disease)
= a / (a + b) 100
specificity = (true negatives)/(true negatives plus false positives)
= (true negatives)/(all those without the disease)
= d / (c + d)  100
PEPI Software: Misclassification of Exposure, Disease, or Both
A sensitivity analysis of nondifferential (assumed same for cases and controls) misclassification
of exposures, as well as differential misclassification (different for cases and controls), as
described in Greenland’s Chapter 19, can be computed using the PEPI software (Abramson and
Gahlinger, 2001). The software performs a sensitivity analysis for any of the three basic study
designs: cross-sectional, case-control, or cohort. These can be matched or non-matched analyses.
It allows for misclassification in the disease variable, the exposure variable, or both.
The module we need, misclass.exe, is in the PEPI Windows Programs subdirectory. It is a DOS
program that runs in Windows, but a Macintosh version is not available.
Greenland (1998, p.344, Table 19-1) presents the following crude data for an unmatched casecontrol study of occupational resins exposure and lung cancer mortality, with an odds ratio of
1.76.
Cases (D = 1)
Controls (D = 0)
Exposed (E = 1)
45
257
Nonexposed (E = 0)
94
945
Total
139
1202
source: Greenland S, Salvan A, Wegman DH, et al. A case-control study of
cancer mortality at a transformer-assembly facility. Int Arch Occup
Env Health 1994;66:49-54.
Chapter 3-13 (revision 16 May 2010)
p. 9
In Greenland’s (1998, p.349) Table 19-4, he presents a sensitivity analysis for a variety of
scenarios. For the following scenario of differential misclassification of resin exposure:
Controls: Sensitivity = 0.80
Specificity = 0.90
Cases:
Sensitivity = 0.90
Specificity = 0.90
the corrected odds ratio is 2.00.
We will practice with the PEPI software, by seeing if we can duplicate Greenland’s result.
Since PEPI is a DOS program, it cannot execute unless the directory name is no more than eight
characters with no embedded spaces. We can accomplish this by copying misclass.exe from the
datasets & do files subdirectory to the desktop. Do that now.
Double clicking on the misclass.exe icon on the desktop,
Screen 1.
Press <Ent> to put the results on the screen only
just hit the enter key
Screen 2.
Type of study
enter 2, since this is an unmatched case-control study design
Screen 3.
Where the same sampling ratios used for both samples (Y/N)?
enter N, since there are an unequal number of cases and controls
Enter an estimate of the disease rate (per 1000) in the
target population, according to the diagnostic method used in
this study.
This is referring to the population disease prevalance. It does not seem to matter
so just enter a number. [In the PEPI manual, it states that this number is used to
base the calculations on the population frequencies, but then it scales the corrected
frequencies down to match the sizes of the samples]. Since it does this scaling
down, it really does not matter what is entered here.
enter 1
Chapter 3-13 (revision 16 May 2010)
p. 10
Screen 4.
Enter the sensitivity of the disease measure (percent):
Among the exposed (or just press <Ent> if 100%)
just hit the enter key – in this example we are assumption no misclassification of
the disease measure
Among the nonexposed (or just press <Ent> if the same)
just hit the enter key
Enter the specificity of the disease measure (percent):
Among the exposed (or just press <Ent> if 100%)
just hit the enter key
Among the nonexposed (or just press <Ent> if the same)
just hit the enter key
Enter the sensitivity of the exposure measure (percent):
Among the diseased (or just press <Ent> if 100%)
enter 90
Among the nondiseased (or just press <Ent> if the same)
enter 80
Enter the specificity of the exposure measure (percent):
Among the diseased (or just press <Ent> if 100%)
enter 90
Among the nondiseased (or just press <Ent> if the same)
just hit the enter key, to repeat the 90
Chapter 3-13 (revision 16 May 2010)
p. 11
Screen 5.
Enter the observed frequencies:
Diseased and exposed
enter 45
Diseased and not exposed
enter 94
Not diseased, exposed
enter 257
Not diseased, not exposed
enter 945
Screen 6.
Shows output
----------------------------------------------------------------Sensitivity
Specificity
Sensitivity
Specificity
of
of
of
of
disease measure:
disease measure:
exposure measure:
exposure measure:
exposed 100%, nonexposed 100%
exposed 100%, nonexposed 100%
diseased 90%, nondiseased 80%
diseased 90%, nondiseased 90%
Observed frequency
Diseased and exposed
Diseased and not exposed
Not diseased, exposed
Not diseased, not exposed
Observed odds ratio = 1.760
Adjusted frequency
45
94
257
945
38.87
100.13
195.43
1006.57
Corrected odds ratio = 2.000
If the adjusted frequencies are rounded off:
corrected OR = 2.014 (approx. 95% CI = 1.349 to 3.006)
enter Q to quit
-----------------------------------------------------------------
Enter Q to quit PEPI.
Chapter 3-13 (revision 16 May 2010)
p. 12
The PEPI software is using the method given in Greenland’s chapter, which involves solving a
system of simultaneous equations to derive the adjusted cell frequencies, after standard methods
are then used for calculating the effect estimate (odds ratio) and confidence interval.
We can verify this using Stata. Open Stata, and use the menus
Statistics
Epidemiology and related
Tables for epidemiologists
Case-control odds ratio calculator
39 100
195 1007
Exact confidence intervals (the default)
OK
cci 39 100 195 1007
Proportion
|
Exposed
Unexposed |
Total
Exposed
-----------------+------------------------+-----------------------Cases |
39
100 |
139
0.2806
Controls |
195
1007 |
1202
0.1622
-----------------+------------------------+-----------------------Total |
234
1107 |
1341
0.1745
|
|
|
Point estimate
|
[95% Conf. Interval]
|------------------------+-----------------------Odds ratio |
2.014
|
1.31109
3.046162 (exact)
Attr. frac. ex. |
.5034757
|
.2372758
.6717181 (exact)
Attr. frac. pop |
.141263
|
+------------------------------------------------chi2(1) =
12.11 Pr>chi2 = 0.0005
We see the odds ratio for rounded adjusted frequencies matches the PEPI output. The
confidence interval does not, because PEPI uses a Woolf confidence interval. To get that, we
can use,
Statistics
Epidemiology and related
Tables for epidemiologists
Case-control odds ratio calculator
39 100
195 1007
Woolf approximation
OK
cci 39 100 195 1007 , woolf
Chapter 3-13 (revision 16 May 2010)
p. 13
Proportion
|
Exposed
Unexposed |
Total
Exposed
-----------------+------------------------+-----------------------Cases |
39
100 |
139
0.2806
Controls |
195
1007 |
1202
0.1622
-----------------+------------------------+-----------------------Total |
234
1107 |
1341
0.1745
|
|
|
Point estimate
|
[95% Conf. Interval]
|------------------------+-----------------------Odds ratio |
2.014
|
1.349303
3.006142 (Woolf)
Attr. frac. ex. |
.5034757
|
.2588765
.6673477 (Woolf)
Attr. frac. pop |
.141263
|
+------------------------------------------------chi2(1) =
12.11 Pr>chi2 = 0.0005
This time the CI agrees with PEPI.
The result agrees with Greenland’s Table 19-4. The OR=2.00 (exact) or OR=2.01 (frequenies
rounded) is the odds ratio that we would expect if no misclassification was present.
That is, in our manuscript, we show the result from our observed data. Then we use PEPI to
come up with the effect measure that we would get if our data did not have the possible
misclassification error. We usually do this for a number of scenarios to provide a broad
sensitivity analysis.
Using PEPI With Confounding Variable(s)
In his chapter, Greenland points out the that same approach can be used when confounders are
present. You simply use PEPI to compute the stratum-specific adjusted frequencies for each
strata of the confounder, and then obtained the summary effect measure by the Mantel-Hanszel
approach.
That is, we use PEPI to get the stratum-specific adjusted frequencies, and then enter these into
Stata to the get the summary effect measure.
To illustrate, suppose the following data are the bias-corrected cell frequencies from PEPI.
Infants with congenital heart disease and Down syndrome and healthy
controls, by maternal spermicide use before conception and maternal
age at delivery
Age <35 years
Age  35 years
Used
Not Used
Used
Not Used
Spermicide Spermicide Spermicide
Spermicide
Cases
3
9
1
3
Controls
104
1059
5
86
Odds Ratio
3.39
5.73
Chapter 3-13 (revision 16 May 2010)
p. 14
We would then enter this data into Stata using a do-file as follows:
clear
input age case spermicide cellcount
0
1
1
3
0
1
0
9
0
0
1
104
0
0
0
1059
1
1
1
1
1
1
0
3
1
0
1
5
1
0
0
86
end
drop if cellcount==0
expand cellcount
drop cellcount
To get the Mantel-Haenzsel pooled odds ratio (also called the summary odds ratio), we use
Statistics
Epidemiology and related
Tables for epidemiologists
Case-control odds ratio
Main tab: Case variable: case
Exposed variable: spermicide
Options tab: Stratify on variable: age
Within-stratum weights: Use Mantel-Haenszel
OK
cc case spermicide, by(age)
age |
OR
[95% Conf. Interval]
M-H Weight
-----------------+------------------------------------------------0 |
3.394231
.5811941
13.87411
.7965957 (exact)
1 |
5.733333
.0911619
85.89589
.1578947 (exact)
-----------------+------------------------------------------------Crude |
3.501529
.808085
11.78958
(exact)
M-H combined |
3.781172
1.18734
12.04142
------------------------------------------------------------------Test of homogeneity (M-H)
chi2(1) =
0.14 Pr>chi2 = 0.7105
Test that combined OR = 1:
Mantel-Haenszel chi2(1) =
Pr>chi2 =
5.81
0.0159
The Summary OR = 3.78 would be our bias-corrected odds ratio after adjusting, or stratifying on,
age.
Chapter 3-13 (revision 16 May 2010)
p. 15
Logistic Regression When the Outcome is Measured with Uncertainty
Software for computing a sensitivity analysis for logistic regression when the outcome variable is
potential misclassification bias is available in Stata. This is the logitem command, which you
have to update Stata to get (see box).
Update your Stata now.
Updating Stata to get logitem command
In the Command Window, run the command,
findit logitem
This will display,
SJ-5-3
sg139_1 . . . . . . . . . . . . . . . . . Software update for logitem
(help logitem if installed) . . . . . . . . . M. Cleves and A. Tosetto
Q3/05
SJ 5(3):470
updated to include prediction program file logite_p
STB-55
sg139 . Logistic reg. when binary outcome is measured with uncertainty
(help logitem if installed) . . . . . . . . . M. Cleves and A. Tosetto
5/00
pp.20--23; STB Reprints Vol 10, pp.152--156
estimates a maximum-likelihood logit regression model using an
EM algorithm when the outcome variable is measured imperfectly
but with known sensitivity and specificity
Click on the sg139_1 link,
INSTALLATION FILES
sg139_1/logitem.ado
sg139_1/logitem.hlp
sg139_1/logite_p.ado
and then click on the (click
(click here to install)
here to install) link.
This will install everything, including what was available from the link sg139, which it updates.
This logistic regression approach is described in Magder and Hughes (1997). The article is a
good citation for your manuscript if you use the logitem command.
Chapter 3-13 (revision 16 May 2010)
p. 16
Madger and Hughes (1997, p. 198) provide the following example.
In a randomized trial of a smoking cessation program among pregnant women, the
subgroup of women randomized to the smoking cessation program are considered. It is
of interest to know what patient characteristics predict successful smoking cessation. The
outcome of smoking cessation is measured by self-report, and there were concerns about
the accuracy of the patients’ reports. The researchers believed that among those who
actually quit smoking, there was no reason to lie, so the probability that they reported
quitting was quite high (sensitivity 100%). However, among those who did not quit,
some might report that they did (specificity <100%). Assuming that up to 10% of these
women lied about quitting smoking, the task is to compute a sensitivity analysis to assess
the impact that an outcome specificity of 90% would have on estimates of the various
characteristics used to predict smoking cessation.
Using Magder and Hughes’ Example 2 (p.198) data not adjusted for uncertainty, and entering it
Stata,
clear
input habit quit cellfreq
1
1
101
1
0
153
0
1
15
0
0
92
end
drop if cellfreq==0 // not needed this time, but necessary sometimes
expand cellfreq
drop cellfreq
save magder, replace
This step has been completed [magder.dta]
The variables are scored as:
habit 1 = reported smoking less than one pack per day
0 = reported smoking at least one pack per day
quit 1 = reported quitting smoking
0 = reported not quitting smoking
Chapter 3-13 (revision 16 May 2010)
p. 17
Reading the data into Stata,
File
Open
Find the directory where you copied the course CD
Change to the subdirectory datasets & do-files
Single click on magder.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\magder.dta", clear
*
which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\”
cd “Biostats & Epi With Stata\datasets & do-files"
use magder.dta, clear
Computing a logistic regression,
Statistics
Binary outcome
Logistic regression (reporting odds ratios)
Model tab: Dependent variable: quit
Independent variables: habit
OK
logistic quit habit
Logistic regression
Log likelihood = -214.06611
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2
=
=
=
=
361
25.19
0.0000
0.0556
-----------------------------------------------------------------------------quit | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------habit |
4.048802
1.24112
4.56
0.000
2.220236
7.383358
------------------------------------------------------------------------------
We observe a four-fold increase in odds of quitting smoking if the one’s smoking habit was less
than one pack per day, relative to a habit of at least one pack per day.
Chapter 3-13 (revision 16 May 2010)
p. 18
Performing a sensitivity analysis for our logistic regression model, to correct for a potential
misclassification of 10% error in overreporting having quit (specificity = 90%),
logitem quit habit ,sens(1.0) spec(.9)
logistic regression when outcome is uncertain
Number of obs
=
361
LR chi2(1)
=
-0.00
Log likelihood = -214.06611
Prob > chi2
=
1.0000
-----------------------------------------------------------------------------| Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------habit |
10.57174
9.384482
2.66
0.008
1.855835
60.22181
------------------------------------------------------------------------------
Whereas the odds ratio for our observed data was OR=4.04, if our data did not have this much
misclassification in the outcome, we would expect to get OR=10.57.
The advantage of the logitem command is that it allows for as many predictor variables as we
want to model.
The shortcoming of the logitem command is that it does not allow you to adjust for
misclassificaiton in the exposure (predictor) variables.
Chapter 3-13 (revision 16 May 2010)
p. 19
A More Up-to-Date Approach
The Greenland chapter on sensitivity analysis (Greenland, 1998), which the PEPI software was
based on, was expanded considerably in the later edition of the textbook (Greenland, 2008).
The approach implemented in PEPI-4 is called deterministic sensitivity analysis.
Deterministic sensitivity analysis assumes that the bias parameters (our guesses of sensitivity and
specificity of the exposure and disease variables) are specified without error. That is, the bias
parameters are treated as if they are perfectly measured themselves, or as if they can only take on
a limited combination of values, being the different combinations we choose to consider if we
show a table of possibilities. Neither treatment is likely to be correct. (Greenland, 2008, p.364).
The deterministic sensitivity analysis, then, does not adjust the p value and confidence intervals
to take into account the uncertainty of the bias parameters. Probablistic sensitivity analysis does
make this adjustment. It chooses the bias parameters from a distribution, computes the unbiased
estimates, and then repeats the process a number of times, a type of Monte Carlo simulation. The
distribution of adjusted values can then be summarized using percentiles. The 50th percentile, or
median, represents the adjusted estimate, such as a risk ratio. The 2.5th and 97.5th percentiles
represent a 95% confidence interval around this estimate. (Greenland, 2008, p.365)
The methods described in Greenland (2008) are now implemented in Stata-10. This software
performs both deterministic and probablistic sensitivity analysis. It also permits simultaneous
bias adjustment for misclassification bias, selection bias, and uncontrolled confounders.
To use this in Stata, you first have to update your Stata to include it. While inside Stata, run the
following command:
findit episens
SJ-8-1
st0138 . . . . . Deterministic and probabilistic sensitivity analysis
. . . . . N. Orsini, R. Bellocco, M. Bottai, A. Wolk, and S. Greenland
(help episens, episensi if installed)
Q1/08
SJ 8(1):29--48
A tool for deterministic and probabilistic sensitivity
analysis of epidemiologic studies that adjusts the
relative risk for exposure misclassification, selection
bias, and an unmeasured confounder
Clicking on the “SJ 8(1):29—48” link will take you to the StataCorp website, where you can
purchase the article $7.50.
http://www.stata-journal.com/article.html?article=st0138
It will also provide you with the abstract of the article (Orsini et al, 2008), which is:
Chapter 3-13 (revision 16 May 2010)
p. 20
Abstract. Classification errors, selection bias, and uncontrolled confounders are likely
to be present in most epidemiologic studies, but the uncertainty introduced by these types
of biases is seldom quantified. The authors present a simple yet easy-to-use Stata
command to adjust the relative risk for exposure misclassification, selection bias, and an
unmeasured confounder. This command implements both deterministic and probabilistic
sensitivity analysis. It allows the user to specify a variety of probability distributions for
the bias parameters, which are used to simulate distributions for the bias-adjusted
exposure–disease relative risk. We illustrate the command by applying it to a case–
control study of occupational resin exposure and lung-cancer deaths. By using plausible
probability distributions for the bias parameters, investigators can report results that
incorporate their uncertainties regarding systematic errors and thus avoid overstating their
certainty about the effect under study. These results can supplement conventional results
and can help pinpoint major sources of conflict in study interpretations.
Next, click on the st0138 link to install the episens command.
Returning to the example, we will first see if the deterministic sensitivity analysis using episensi
give the same result as PEPI.
Recall the 2 x 2 table was:
Cases (D = 1)
Controls (D = 0)
Exposed (E = 1)
45
257
Nonexposed (E = 0)
94
945
Total
139
1202
and we wanted to make the following misclassification bias adjustments for the exposure
variable:
Controls: Sensitivity = 0.80
Specificity = 0.90
Cases:
Sensitivity = 0.90
Specificity = 0.90
Doing that in Stata, we use (use the /// to specify a continuation inside the do-file editor, omit that
and put everything on one line in the command window):
episensi 45 94 257 945 , st(cc) dsenc(c(.90)) dspnc(c(.80)) ///
dseca(c(.90)) dspca(c(.90))
where:
st(cc) = study is case-control
dsenc( ) = sensitivity noncases
dspnc( ) = specificity noncases
dseca( ) = sensitivity cases
dspca( ) = specificity cases
the c(#) means constant parametric (deterministic) rather than specifying a
probability distribution (probablistic).
Chapter 3-13 (revision 16 May 2010)
p. 21
Stata outputs the result:
Se|Cases
:
Sp|Cases
:
Se|No-Cases:
Sp|No-Cases:
Constant(.9)
Constant(.9)
Constant(.8)
Constant(.9)
Observed Odds Ratio [95% Conf. Interval]= 1.76 [1.20, 2.58]
Deterministic sensitivity analysis for misclassification of the exposure
External adjusted Odds Ratio = 2.00
Percent bias = -12%
The “percent bias” comes from : (1.76 – 2.00)/2.00 × 100
Comparing this to the PEPI output produced above:
----------------------------------------------------------------Sensitivity
Specificity
Sensitivity
Specificity
of
of
of
of
disease measure:
disease measure:
exposure measure:
exposure measure:
exposed 100%, nonexposed 100%
exposed 100%, nonexposed 100%
diseased 90%, nondiseased 80%
diseased 90%, nondiseased 90%
Observed frequency
Diseased and exposed
Diseased and not exposed
Not diseased, exposed
Not diseased, not exposed
Observed odds ratio = 1.760
Adjusted frequency
45
94
257
945
38.87
100.13
195.43
1006.57
Corrected odds ratio = 2.000
If the adjusted frequencies are rounded off:
corrected OR = 2.014 (approx. 95% CI = 1.349 to 3.006)
enter Q to quit
-----------------------------------------------------------------
We see that we get the same result.
Chapter 3-13 (revision 16 May 2010)
p. 22
Exericise: Look at Takkouche’s simple approach for a sensitivity analysis (Takkouche et al,
2002)
Read last two paragraphs of Results section (page 856).
This allowed them to say in the Second paragraph of Discussion section on page 857,
“Thus, the inverse association we found is not easily ascribed to confounding,
misclassifcication, or selection bias due to loss to follow-up.”
Exercise: Look at Rauscher’s simple approach for a sensitivity analysis (Rauscher et al, 2000).
In the Discussion section, 2nd paragraph of 2nd column, they state,
“Methods of sensitivity analyses described by Greenland (42) were used to
explore the effect of potential control selection bias and differential recall
of cases and controls....Sensitivity analyses of potential control selection bias
revealed this explanation to be highly unlikely for the association observed
(data not shown).”
Exercise: For an example of a sensitivity analysis where the results are shown, look at the Korte
paper (Korte et al, 2002).
The approach is described in the section called “Simulations of misclassification” on
page 501, and the results are shown in Tables 5 and 6 on page 502.
Chapter 3-13 (revision 16 May 2010)
p. 23
References
Abramson JH, Gahlinger PM. (2001). Programs for Epidemiologists: PEPI Version 4.0. Salt
Lake City, UT, Sagebrush Press.
Begg CB, Greenes RA. (1983). Assessment of diagnostic tests when disease verification is
subject to selection bias. Biometrics 39:207-215.
Carpenter J, Bithell J. (2000). Bootstrap confidence intervals: when, which, what? A practical
guide for medical statisticians. Statist. Med. 19:1141-1164.
Greenland S. (1998). Chapter 19. Basic methods for sensitivity analysis and external adjustment.
In, Rothman KJ, Greenland S. Modern Epidemiology, 2nd ed. Philadelphia PA,
Lippincott-Raven Publishers, pp. 343-357.
Greenland S. (2008). Chapter 19. Bias analysis. In, Rothman KJ, Greenland S, Lash TL. Modern
Epidemiology, 3rd ed. Philadelphia PA, Lippincott Williams & Wilkins, pp. 345-380.
Korte JE, Brennan P, Henley SJ, Boffetta P. (2002). Dose-specific meta-analysis and sensitivity
analysis of the relation between alcohol consumption and lung cancer risk. Am J
Epidemiol 155(6):496-506.
Magder LS, Hughes JP. Logistic regression when the outcome is measured with
uncertainty. Am J Epidemiol 1997;146(2):195-203.
Orsini N, Bellocco R, Bottai M, Wolk A, Greenland S. (2008). A tool for deterministic and
probabilistic sensitivity analysis of epidemiologic studies. The Stata Journal 8(1):29-48.
Pepe MS. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction,
New York, Oxford University Press, p.168-173.
Rauscher GH, Mayne ST, Janerich DT. (2000). Relation between body mass index and lung
cancer risk in men and women never and former smokers. Am J Epidemiol 152:50613.
Takkouche B, Regueira-Mendez C, Garcia-Closas R, et al (2002). Intake of wine, beer, and
spirits and the risk of clinical common cold. Am J Epidemiol 155:853-8.
Chapter 3-13 (revision 16 May 2010)
p. 24
Download