elps5266-sup-0001-Suppmat

advertisement
Supplementary Material
Methods
Logistic Regression
Logistic regression (LogR) forms a best fitting equation or function using the maximum
likelihood method, which maximizes the probability of classifying the observed data into the
appropriate category given the regression coefficients. The goal is to correctly predict the
category of outcome for individual cases using the most parsimonious model. To accomplish this
goal, a model is created that includes all predictor variables that are useful in predicting the
response variable.
Logistic regression generates the coefficients (and its standard errors and significance
levels) of a formula to predict a logit transformation of the probability of presence of the
characteristic of interest:
logit(𝑝) = 𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2 +…π‘π‘˜ π‘‹π‘˜
where 𝑝 is the probability of presence of the characteristic of interest. The logit transformation is
defined as the logged odds
π‘œπ‘‘π‘‘π‘  =
𝑝
π‘π‘Ÿπ‘œπ‘π‘Žπ‘π‘–π‘™π‘–π‘‘π‘¦ π‘œπ‘“ π‘π‘Ÿπ‘’π‘ π‘’π‘›π‘π‘’ π‘œπ‘“ π‘β„Žπ‘Žπ‘Ÿπ‘Žπ‘π‘‘π‘’π‘Ÿπ‘–π‘ π‘‘π‘–π‘
=
1 − 𝑝 π‘π‘Ÿπ‘œπ‘π‘Žπ‘π‘–π‘™π‘–π‘‘π‘¦ π‘œπ‘“ π‘Žπ‘π‘ π‘π‘’π‘›π‘π‘’ π‘œπ‘“ π‘β„Žπ‘Žπ‘Ÿπ‘Žπ‘π‘‘π‘’π‘Ÿπ‘ π‘–π‘‘π‘–π‘
𝑝
andlogit(𝑝) = ln [1−𝑝]
Rather than choosing parameters that minimize the sum of squared errors (like in
ordinary regression), estimation in logistic regression chooses parameters that maximize the
likelihood of observing the sample values.
Model Validation
In logistic regression, two hypotheses are of interest: the null hypothesis, which is when
all the coefficients in the regression equation take the value zero, and the alternative hypothesis
that the model with predictors currently under consideration is accurate and differs significantly
from the null hypothesis.
The likelihood ratio test is based on -2 log likelihood (–2LL) ratio. It tests the difference
between –2LL for the full model with predictors and –2LL for initial chi-square in the null model
with only a constant in it. It measures the improvement in fit that the explanatory variables make
compared to the null model.
The Wald statistic, akin to the t-test in linear regression, is an alternative way of assessing
the contribution of each predictor to a model. It is given by:
Wald =[
𝐡 2
𝑆.𝐸
]
, where B is the regression coefficient and SE is the standard error of the
regression coefficient. Each Wald statistic is compared to a chi-square distribution with one
degree of freedom.
The Nagelkerke R square is defined as:
𝐿(0)
2⁄𝑛
2
𝑅 2 ⁄𝑅max
, where 𝑅 2 = 1 − (𝐿(πœƒΜ‚))
2
and 𝑅max
= 1 − (𝐿(0))
2⁄𝑛
where L(0) is the likelihood of obtaining the observations if the independent variables had no
effect on the outcome,𝐿(πœƒΜ‚)is the likelihood of the model with a given set of parameter estimates)
and n is the sample size.
The Hosmer-Lemeshow test [1] is a statistical test for goodness of fit for the logistic
regressionmodel. The data are divided into approximately ten groups defined by increasing order
of estimated risk. The observed and expected number of cases in each group is calculated and a
chi-squared statistic is calculated as follows:
𝑛
𝐻=∑
𝑔=1
(𝑂𝑔 − 𝐸𝑔 )2
𝐸𝑔 (1 − 𝐸𝑔 ⁄𝑛𝑔 )
with 𝑂𝑔 , 𝐸𝑔 and 𝑛𝑔 the observed events, expected events and number of observations for
the 𝑔th risk decile group, and n the number of groups. The test statistic follows a chi-squared
distribution with n-2 degrees of freedom. A large value of chi-squared (with small p-value <
0.05) indicates poor fit and small chi-squared values (with larger p-value closer to 1) indicate a
good logistic regression model fit.
The receiver operating characteristic (ROC) curve is very useful for evaluating the
predictive accuracy of a chosen model in logistic regression [2]. The curve is obtained by
plotting sensitivity (True Positive) against 1 – specificity (False Positive). A perfect
classification with 100% true positive and 0% false positive would have an area equal to 1.
Model Classification
By taking the exponential of both sides of the regression equation, the equation can be
rewritten as:
π‘œπ‘‘π‘‘π‘  =
𝑝
= 𝑒 𝑏0 × π‘’ 𝑏1 𝑋1 × π‘’ 𝑏2 𝑋2 × π‘’ 𝑏3 𝑋3 × … × π‘’ π‘π‘˜π‘‹π‘˜
1−𝑝
Once the logit is estimated, then the probability that the outcome is skin or menstrual blood is
given by:
𝑃=
exp(ln(𝑂))
1 + exp(ln(𝑂))
𝑝
where the odd ratio 𝑂 = 1−𝑝
Outliers and Influential cases
Standardized residuals are defined as the difference between the observed response and
the predicted probability:
π‘…π‘’π‘ π‘–π‘‘π‘’π‘Žπ‘™ = 𝑦 − π‘ƒπ‘Ÿπ‘’π‘‘π‘–π‘π‘‘π‘’π‘‘ π‘ƒπ‘Ÿπ‘œπ‘π‘Žπ‘π‘–π‘™π‘–π‘‘π‘¦
𝑍=
π‘…π‘’π‘ π‘–π‘‘π‘’π‘Žπ‘™
√(π‘ƒπ‘Ÿπ‘’π‘‘π‘–π‘π‘‘π‘’π‘‘ π‘ƒπ‘Ÿπ‘œπ‘π‘Žπ‘π‘–π‘™π‘–π‘‘π‘¦ ) (1−π‘ƒπ‘Ÿπ‘’π‘‘π‘–π‘π‘‘π‘’π‘‘ π‘ƒπ‘Ÿπ‘œπ‘π‘Žπ‘π‘–π‘™π‘–π‘¦)
The Cook distance is defined as:
β„Ž 𝑍2
Cook D =
,
(1−β„Ž)2
where Z is the standard residual and h is the leverage value. The Cook distance measures the
influence of a case. It measures the effect of deleting a case on the model residuals.
References
1. Hosmer, W., Lemeshow, S., Applied Logistic Regression, John Wiley & Sons, Inc.,
Hoboken, NJ 2000.
2. Agresti, A., Categorical Data Analysis, John Wiley & Sons, Hoboken 2002.
Supplementary Table 1. Logistic Regression Model – Menstrual Blood
Menstrual Blood
B
S.E.
Wald
df
Sig.
miR185-5p by miR144-5p by miR144-3p
-0.005
0.002
5.600
1
0.018
miR185-5p
3.718
1.540
5.828
1
0.016
-32.017
12.907
6.153
1
0.013
Constant
B = regression coefficient
S.E. = standard error of regression coefficient
df = degrees of freedom
Sig. = significance
Supplementary Table 2.Area under ROC (Receiver Operating Curve) for the MB Assay.
Menstrual Blood
Area
Std. Error
Asymptotic Sig.
0.994
0.006
0.000
Asymptotic 95% Confidence Interval
Lower Bound
Upper Bound
0.982
1.000
Supplementary Table 3. Case List – Menstrual blood model
Case
100
Selected
Status
S
Observed
y
1**
Predicted
0.052
Predicted
Group
0
Temporary Variable
Resid
ZResid
0.948
4.285
Supplementary Table 4. Summary Statistics for the Menstrual Blood Model
Model Validation
Area Under Receiver
Operating Curve
Classification:
Training Set
Classification:
Test set
Model chi-square
Nagelkerke R-square
Hosmer-Lemeshow Test
Estimated Area under Operating Curve
95 Percent Confidence Interval
Optimal Threshold
True Positive Rate
False Positive Rate
46.429 (p < 0.001)
86.2%
3.492 (p = 0.900)
99.4%
98.2%-100%
0.5
87.5% (7/8)
0%
False Negative Rate
12.5% (1/8)
Correct Classification Rate
True Positive Rate
False Positive Rate
99% (102/103)
100%
0%
False Negative Rate
0%
Supplementary Figure Legends
Supplementary Figure 1.
A) Receiver Operating Curve for the Menstrual Blood LogR Model
B) Histogram of estimated probabilities. The symbol of each case represents the group to
which the case actually belongs. The green colored symbols represent menstrual blood
cases, the red colored symbols represent the non-menstrual blood cases. X-axis: LogR Pvalue. Y-axis: number of cases (samples).
Download