Supplementary table X: Area Under Curve (AUC) statistics for

advertisement
Re et al. Additional File 1:
Tests of independence of maximum-scoring reading frames
Several of the features used here rely on characteristics expected of coding and non-coding
sequences as a result of evolutionary constraints implied by the genetic code. Several such
measures are calculated for each possible reading frame of alignments and either the highest
scoring frame or a measure of difference of the highest scoring frames and scores from other
possible reading frames are used as indicators of coding potential (see Materials and Methods).
Furthermore, some pairs of features share elements of the formulae used in their calculation. For
example, the Coding Potential Score (CPS), incorporates the ratio between synonymous and nonsynonymous substitution frequencies (Ns/Nns) and amino acid similarity scores (AAsim). It is
therefore expected that such measures should consistently identify the same maximum scoring
reading frames in CSTs analysed. We performed statistical independence tests of the frames in
which a subset of features (amino acid identity (AAid), amino acid similarity (AAsim), ratio of
synonymous to non synonymous substitution frequencies (Ns/Nns-best), frequency of stop codons
(Stop-best) and CPS) produced the maximum scores for long and short CSTs. For each pair of
descriptors we constructed a 6 by 6 contingency table containing the number of the observed
maximum scoring frame co-occurrences for all the possible frames and tested the hypothesis:
H0: D(X|Y)=D(X)
using the Cochran-Mantel-Haenszel test to assess the independence of the maximum-scoring
frame of feature X given the observed maximum-scoring frame of feature Y. Results are reported
in Table S1.
It is interesting to note that while the dependence of maximum-scoring reading frames of some
feature pairs is evident both for coding and non-coding CSTs (eg. CPS-best and Ns/Nns-best,
CPS-best and AAsim-best, AAsim-best and AAid-best) – due to shared components of formulae,
some pairs of descriptors are statistically independent (with respect to the observed max frame) for
non-coding CSTs and show clear dependence only for coding CSTs (as in the case of the Stopbest and AAid-best, Stop-best and Ns/Nns-best, CPS-best and Stop-best feature pairs).
The SpectrAlign feature does not provide a frame specific measure but detects dishomogeneities
of substitution frequencies occurring between all frames. Other observed frame dependency
patterns are independent of the length classes employed.
Comparisons of CSTminer with the current method:
In order to compare the performances of CSTminer and the method proposed here, we computed
the areas under the curve using the predicted CP of all the CSTs belonging to the evaluation set
produced by CSTminer and the new method. This test has been performed comparing the
performances of the two methods in the prediction of long CSTs, short CSTs and all the evaluation
set CSTs disregarding their length. The statistical significance of the differences observed
between the AUC scores were then assessed by using the non parametric Mann-Whitney test at
0.01 significance cutoff.
test set comparison:
CSTminer AUC:
current AUC:
Mann-Withney test
----------------------------------------------------------------------Long CSTs
0.933892
0.99269
2.65240e-22
Short CSTs
0.810800
0.95458
6.30707e-29
ALL
0.871634
0.980134
2.08694e-61
Figure S1:
ROC curve comparisons for CSTminer and the current method using the test dataset: Solid lines show the performance of CSTminer and
dashed lines the performance of the new method.
Supplementary Tables
Table S1: Cochran-Mantel-Haenszel tests of independence of the maximum-scoring frame of feature X given the observed maximum-scoring
frame of feature Y. Chi squared scores and associated Pvalues for independence of maximum scoring frame of any feature, given the observed
maximum scoring frame of another feature are reported.
feature A
given
Aasim-best
Ns/Nns-best
Stop-best
CPS-best
Ns/Nns-best
Stop-best
CPS-best
Stop-best
CPS-best
CPS-best
Aaid-best
Aaid-best
Aaid-best
Aaid-best
Aasim-best
Aasim-best
Aasim-best
Ns/Nns-best
Ns/Nns-best
Stop-best
Short Coding
Chisq
1966.97
2091.13
374.65
1971.59
1681.92
599.59
2047.6
343.75
2692.38
418.24
Pval
< 2.2e-16
< 2.2e-16
< 2.2e-16
< 2.2e-16
< 2.2e-16
< 2.2e-16
< 2.2e-16
< 2.2e-16
< 2.2e-16
< 2.2e-16
In all cases df = 25, 1% significance threshold = 44.3141
Short Non-Coding
Chisq
Pval
1641.05
2467.63
23.32
2313.54
1298.21
257.15
2050.97
22.84
4946.33
31.15
< 2.2e-16
< 2.2e-16
0.56
< 2.2e-16
< 2.2e-16
< 2.2e-16
< 2.2e-16
0.59
< 2.2e-16
0.18
Long Coding
Chisq
4521.97
4427.17
1780.36
4589.72
4008.56
2467.66
4713.59
1653.25
5316.43
1945.38
Pval
< 2.2e-16
< 2.2e-16
< 2.2e-16
< 2.2e-16
< 2.2e-16
< 2.2e-16
< 2.2e-16
< 2.2e-16
< 2.2e-16
< 2.2e-16
Long Non-Coding
Chisq
Pval
934.65
1476.21
30.08
1441.88
742.93
232.29
1109.84
20.51
3191.77
27.39
< 2.2e-16
< 2.2e-16
0.52
< 2.2e-16
< 2.2e-16
< 2.2e-16
< 2.2e-16
0.49
< 2.2e-16
0.21
Table S2: Area Under Curve (AUC) statistics for trained SVM and individual features using validation sets.
Feature
AUC: Long
AUC: Short
SVM
CPS-ratio
SpectrAlign
CPS-best
Ns/Nns-best
GC-probe
GC-target
Ns/Nns-ratio
Aasim-best
Codon-sim-ratio
AAID-best
Aasim-ratio
AAID-ratio
GFB-length
Tv/subs
GFB-ntID
Stop-delta
Stop-best
0.990921
0.934761
0.925164
0.874976
0.874533
0.849544
0.848471
0.8265
0.820721
0.763903
0.756144
0.728704
0.703765
0.55648
0.542212
0.528153
0.474269
0.136263
0.954042
0.835806
0.829978
0.823951
0.810212
0.756367
0.763691
0.726098
0.816972
0.590796
0.757125
0.621789
0.603012
0.615006
0.526844
0.609075
0.402685
0.335132
Table S3: Significance of SVM performance with respect to that of individual features for short and long validation tests using a non-parametric
test of ROC curve characteristics.
Method vs
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
Feature
SpectrAlign
GFB-best
GFB-delta
CPS-best
CPS-ratio
Codon-sim-ratio
AAsim-best
AAsim-delta
AAID-best
AAID-ratio
Ns/Nns-best
Ns/Nns-ratio
GFB-ntID
Tv/subs
GC-probe
GC-target
GFB-length
Long CST validation set
SVM Lose
Tie
SVM Win
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Short CST validation set
SVM Lose
Tie
SVM Win
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
At the 1% confidence interval, the final SVM model significantly outperforms all of the individual component features
for both the LONG and SHORT CST evaluation datasets according to the test proposed by Delong et al. [16] and
implemented in the software STaR [17].
Table S4: Significance of difference of performance between pairs of individual features for short and long validation tests using a non-parametric
test of ROC curve characteristics.
Feature1
Feature2
Stop-best
Stop-delta
Stop-delta
CPS-best
CPS-best
CPS-best
CPS-ratio
CPS-ratio
CPS-ratio
CPS-ratio
Codon-sim-ratio
Codon-sim-ratio
Codon-sim-ratio
Codon-sim-ratio
Codon-sim-ratio
AAID-best
AAID-best
AAID-best
AAID-best
AAID-best
AAID-best
AAsim-ratio
AAsim-ratio
AAsim-ratio
AAsim-ratio
AAsim-ratio
AAsim-ratio
AAsim-ratio
AAsim-best
AAsim-best
AAsim-best
AAsim-best
AAsim-best
AAsim-best
AAsim-best
AAsim-best
AAID-ratio
AAID-ratio
AAID-ratio
AAID-ratio
AAID-ratio
AAID-ratio
AAID-ratio
AAID-ratio
AAID-ratio
Ns/Nns-best
Ns/Nns-best
Ns/Nns-best
Ns/Nns-best
Ns/Nns-best
Ns/Nns-best
Ns/Nns-best
Ns/Nns-best
Ns/Nns-best
Ns/Nns-best
Ns/Nns-ratio
Ns/Nns-ratio
Ns/Nns-ratio
Ns/Nns-ratio
Ns/Nns-ratio
Ns/Nns-ratio
Ns/Nns-ratio
Ns/Nns-ratio
Ns/Nns-ratio
Ns/Nns-ratio
Ns/Nns-ratio
GFB-ntID
GFB-ntID
GFB-ntID
GFB-ntID
GFB-ntID
GFB-ntID
GFB-ntID
GFB-ntID
GFB-ntID
GFB-ntID
GFB-ntID
GFB-ntID
Tv/Subs
Tv/Subs
Tv/Subs
Tv/Subs
Tv/Subs
Tv/Subs
Tv/Subs
Tv/Subs
Tv/Subs
Tv/Subs
Tv/Subs
Tv/Subs
Tv/Subs
GC-probe
GC-probe
GC-probe
GC-probe
GC-probe
GC-probe
GC-probe
GC-probe
GC-probe
GC-probe
GC-probe
GC-probe
GC-probe
GC-probe
GC-target
GC-target
GC-target
GC-target
GC-target
GC-target
GC-target
GC-target
GC-target
GC-target
GC-target
GC-target
GC-target
GC-target
GC-target
GFB-length
GFB-length
GFB-length
GFB-length
GFB-length
GFB-length
GFB-length
GFB-length
GFB-length
GFB-length
GFB-length
GFB-length
GFB-length
GFB-length
GFB-length
GFB-length
SpectrAlign
SpectrAlign
Stop-best
SpectrAlign
Stop-best
Stop-delta
SpectrAlign
Stop-best
Stop-delta
CPS-best
SpectrAlign
Stop-best
Stop-delta
CPS-best
CPS-ratio
SpectrAlign
Stop-best
Stop-delta
CPS-best
CPS-ratio
Codon-sim-ratio
SpectrAlign
Stop-best
Stop-delta
CPS-best
CPS-ratio
Codon-sim-ratio
AAID-best
SpectrAlign
Stop-best
Stop-delta
CPS-best
CPS-ratio
Codon-sim-ratio
AAID-best
AAsim-ratio
SpectrAlign
Stop-best
Stop-delta
CPS-best
CPS-ratio
Codon-sim-ratio
AAID-best
AAsim-ratio
AAsim-best
SpectrAlign
Stop-best
Stop-delta
CPS-best
CPS-ratio
Codon-sim-ratio
AAID-best
AAsim-ratio
AAsim-best
AAID-ratio
SpectrAlign
Stop-best
Stop-delta
CPS-best
CPS-ratio
Codon-sim-ratio
AAID-best
AAsim-ratio
AAsim-best
AAID-ratio
Ns/Nns-best
SpectrAlign
Stop-best
Stop-delta
CPS-best
CPS-ratio
Codon-sim-ratio
AAID-best
AAsim-ratio
AAsim-best
AAID-ratio
Ns/Nns-best
Ns/Nns-ratio
SpectrAlign
Stop-best
Stop-delta
CPS-best
CPS-ratio
Codon-sim-ratio
AAID-best
AAsim-ratio
AAsim-best
AAID-ratio
Ns/Nns-best
Ns/Nns-ratio
GFB-ntID
SpectrAlign
Stop-best
Stop-delta
CPS-best
CPS-ratio
Codon-sim-ratio
AAID-best
AAsim-ratio
AAsim-best
AAID-ratio
Ns/Nns-best
Ns/Nns-ratio
GFB-ntID
Tv/Subs
SpectrAlign
Stop-best
Stop-delta
CPS-best
CPS-ratio
Codon-sim-ratio
AAID-best
AAsim-ratio
AAsim-best
AAID-ratio
Ns/Nns-best
Ns/Nns-ratio
GFB-ntID
Tv/Subs
GC-probe
SpectrAlign
Stop-best
Stop-delta
CPS-best
CPS-ratio
Codon-sim-ratio
AAID-best
AAsim-ratio
AAsim-best
AAID-ratio
Ns/Nns-best
Ns/Nns-ratio
GFB-ntID
Tv/Subs
GC-probe
GC-target
Long CSTs
W
T
0
0
1
0
1
1
0
1
1
1
0
1
1
0
0
0
1
1
0
0
1
0
1
1
0
0
0
0
0
1
1
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
1
1
0
0
1
1
1
1
1
0
1
1
0
0
1
0
1
1
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
0
1
1
1
0
0
1
1
0
1
1
0
0
1
0
1
1
1
0
0
1
1
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
1
0
0
0
1
1
0
0
0
0
0
1
0
0
1
0
0
0
1
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
L
1
1
0
1
0
0
0
0
0
0
1
0
0
1
1
1
0
0
1
1
0
1
0
0
1
1
0
1
1
0
0
1
1
0
1
0
1
0
0
1
1
1
1
1
1
1
0
0
0
1
0
0
0
0
0
1
0
0
1
1
0
0
0
0
0
1
1
0
0
1
1
1
1
1
1
1
1
1
1
0
0
1
1
1
1
1
1
1
1
1
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
1
1
1
1
1
1
1
1
1
0
0
1
1
Short CSTs
W
T
0
0
1
0
1
1
0
1
1
0
0
1
1
0
0
0
1
1
0
0
1
0
1
1
0
0
0
0
0
1
1
0
0
1
0
1
0
1
1
0
0
0
0
0
0
0
1
1
0
0
1
0
1
1
1
0
1
1
0
0
1
0
1
0
1
0
0
1
1
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
0
1
0
1
0
0
1
1
0
1
1
0
0
1
0
1
0
1
0
0
1
1
0
0
1
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
1
0
0
1
0
0
0
0
0
1
0
0
1
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
1
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
1
0
0
0
0
0
1
0
1
0
1
0
0
1
0
0
0
L
1
1
0
0
0
0
0
0
0
0
1
0
0
1
1
0
0
0
0
0
0
1
0
0
1
1
0
1
1
0
0
1
1
0
1
0
1
0
0
1
1
0
1
1
1
0
0
0
1
1
0
0
0
0
0
1
0
0
1
1
0
1
0
1
0
1
1
0
0
1
1
0
1
0
1
0
1
1
1
0
0
1
1
1
1
1
1
1
1
1
1
1
0
0
1
1
0
1
0
0
0
0
0
0
0
1
0
0
1
1
0
1
0
0
0
1
0
0
0
0
1
0
0
1
1
0
1
0
1
0
1
1
0
0
1
1
Download