Re et al. Additional File 1: Tests of independence of maximum-scoring reading frames Several of the features used here rely on characteristics expected of coding and non-coding sequences as a result of evolutionary constraints implied by the genetic code. Several such measures are calculated for each possible reading frame of alignments and either the highest scoring frame or a measure of difference of the highest scoring frames and scores from other possible reading frames are used as indicators of coding potential (see Materials and Methods). Furthermore, some pairs of features share elements of the formulae used in their calculation. For example, the Coding Potential Score (CPS), incorporates the ratio between synonymous and nonsynonymous substitution frequencies (Ns/Nns) and amino acid similarity scores (AAsim). It is therefore expected that such measures should consistently identify the same maximum scoring reading frames in CSTs analysed. We performed statistical independence tests of the frames in which a subset of features (amino acid identity (AAid), amino acid similarity (AAsim), ratio of synonymous to non synonymous substitution frequencies (Ns/Nns-best), frequency of stop codons (Stop-best) and CPS) produced the maximum scores for long and short CSTs. For each pair of descriptors we constructed a 6 by 6 contingency table containing the number of the observed maximum scoring frame co-occurrences for all the possible frames and tested the hypothesis: H0: D(X|Y)=D(X) using the Cochran-Mantel-Haenszel test to assess the independence of the maximum-scoring frame of feature X given the observed maximum-scoring frame of feature Y. Results are reported in Table S1. It is interesting to note that while the dependence of maximum-scoring reading frames of some feature pairs is evident both for coding and non-coding CSTs (eg. CPS-best and Ns/Nns-best, CPS-best and AAsim-best, AAsim-best and AAid-best) – due to shared components of formulae, some pairs of descriptors are statistically independent (with respect to the observed max frame) for non-coding CSTs and show clear dependence only for coding CSTs (as in the case of the Stopbest and AAid-best, Stop-best and Ns/Nns-best, CPS-best and Stop-best feature pairs). The SpectrAlign feature does not provide a frame specific measure but detects dishomogeneities of substitution frequencies occurring between all frames. Other observed frame dependency patterns are independent of the length classes employed. Comparisons of CSTminer with the current method: In order to compare the performances of CSTminer and the method proposed here, we computed the areas under the curve using the predicted CP of all the CSTs belonging to the evaluation set produced by CSTminer and the new method. This test has been performed comparing the performances of the two methods in the prediction of long CSTs, short CSTs and all the evaluation set CSTs disregarding their length. The statistical significance of the differences observed between the AUC scores were then assessed by using the non parametric Mann-Whitney test at 0.01 significance cutoff. test set comparison: CSTminer AUC: current AUC: Mann-Withney test ----------------------------------------------------------------------Long CSTs 0.933892 0.99269 2.65240e-22 Short CSTs 0.810800 0.95458 6.30707e-29 ALL 0.871634 0.980134 2.08694e-61 Figure S1: ROC curve comparisons for CSTminer and the current method using the test dataset: Solid lines show the performance of CSTminer and dashed lines the performance of the new method. Supplementary Tables Table S1: Cochran-Mantel-Haenszel tests of independence of the maximum-scoring frame of feature X given the observed maximum-scoring frame of feature Y. Chi squared scores and associated Pvalues for independence of maximum scoring frame of any feature, given the observed maximum scoring frame of another feature are reported. feature A given Aasim-best Ns/Nns-best Stop-best CPS-best Ns/Nns-best Stop-best CPS-best Stop-best CPS-best CPS-best Aaid-best Aaid-best Aaid-best Aaid-best Aasim-best Aasim-best Aasim-best Ns/Nns-best Ns/Nns-best Stop-best Short Coding Chisq 1966.97 2091.13 374.65 1971.59 1681.92 599.59 2047.6 343.75 2692.38 418.24 Pval < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 In all cases df = 25, 1% significance threshold = 44.3141 Short Non-Coding Chisq Pval 1641.05 2467.63 23.32 2313.54 1298.21 257.15 2050.97 22.84 4946.33 31.15 < 2.2e-16 < 2.2e-16 0.56 < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 0.59 < 2.2e-16 0.18 Long Coding Chisq 4521.97 4427.17 1780.36 4589.72 4008.56 2467.66 4713.59 1653.25 5316.43 1945.38 Pval < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 Long Non-Coding Chisq Pval 934.65 1476.21 30.08 1441.88 742.93 232.29 1109.84 20.51 3191.77 27.39 < 2.2e-16 < 2.2e-16 0.52 < 2.2e-16 < 2.2e-16 < 2.2e-16 < 2.2e-16 0.49 < 2.2e-16 0.21 Table S2: Area Under Curve (AUC) statistics for trained SVM and individual features using validation sets. Feature AUC: Long AUC: Short SVM CPS-ratio SpectrAlign CPS-best Ns/Nns-best GC-probe GC-target Ns/Nns-ratio Aasim-best Codon-sim-ratio AAID-best Aasim-ratio AAID-ratio GFB-length Tv/subs GFB-ntID Stop-delta Stop-best 0.990921 0.934761 0.925164 0.874976 0.874533 0.849544 0.848471 0.8265 0.820721 0.763903 0.756144 0.728704 0.703765 0.55648 0.542212 0.528153 0.474269 0.136263 0.954042 0.835806 0.829978 0.823951 0.810212 0.756367 0.763691 0.726098 0.816972 0.590796 0.757125 0.621789 0.603012 0.615006 0.526844 0.609075 0.402685 0.335132 Table S3: Significance of SVM performance with respect to that of individual features for short and long validation tests using a non-parametric test of ROC curve characteristics. Method vs SVM SVM SVM SVM SVM SVM SVM SVM SVM SVM SVM SVM SVM SVM SVM SVM SVM Feature SpectrAlign GFB-best GFB-delta CPS-best CPS-ratio Codon-sim-ratio AAsim-best AAsim-delta AAID-best AAID-ratio Ns/Nns-best Ns/Nns-ratio GFB-ntID Tv/subs GC-probe GC-target GFB-length Long CST validation set SVM Lose Tie SVM Win 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Short CST validation set SVM Lose Tie SVM Win 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 At the 1% confidence interval, the final SVM model significantly outperforms all of the individual component features for both the LONG and SHORT CST evaluation datasets according to the test proposed by Delong et al. [16] and implemented in the software STaR [17]. Table S4: Significance of difference of performance between pairs of individual features for short and long validation tests using a non-parametric test of ROC curve characteristics. Feature1 Feature2 Stop-best Stop-delta Stop-delta CPS-best CPS-best CPS-best CPS-ratio CPS-ratio CPS-ratio CPS-ratio Codon-sim-ratio Codon-sim-ratio Codon-sim-ratio Codon-sim-ratio Codon-sim-ratio AAID-best AAID-best AAID-best AAID-best AAID-best AAID-best AAsim-ratio AAsim-ratio AAsim-ratio AAsim-ratio AAsim-ratio AAsim-ratio AAsim-ratio AAsim-best AAsim-best AAsim-best AAsim-best AAsim-best AAsim-best AAsim-best AAsim-best AAID-ratio AAID-ratio AAID-ratio AAID-ratio AAID-ratio AAID-ratio AAID-ratio AAID-ratio AAID-ratio Ns/Nns-best Ns/Nns-best Ns/Nns-best Ns/Nns-best Ns/Nns-best Ns/Nns-best Ns/Nns-best Ns/Nns-best Ns/Nns-best Ns/Nns-best Ns/Nns-ratio Ns/Nns-ratio Ns/Nns-ratio Ns/Nns-ratio Ns/Nns-ratio Ns/Nns-ratio Ns/Nns-ratio Ns/Nns-ratio Ns/Nns-ratio Ns/Nns-ratio Ns/Nns-ratio GFB-ntID GFB-ntID GFB-ntID GFB-ntID GFB-ntID GFB-ntID GFB-ntID GFB-ntID GFB-ntID GFB-ntID GFB-ntID GFB-ntID Tv/Subs Tv/Subs Tv/Subs Tv/Subs Tv/Subs Tv/Subs Tv/Subs Tv/Subs Tv/Subs Tv/Subs Tv/Subs Tv/Subs Tv/Subs GC-probe GC-probe GC-probe GC-probe GC-probe GC-probe GC-probe GC-probe GC-probe GC-probe GC-probe GC-probe GC-probe GC-probe GC-target GC-target GC-target GC-target GC-target GC-target GC-target GC-target GC-target GC-target GC-target GC-target GC-target GC-target GC-target GFB-length GFB-length GFB-length GFB-length GFB-length GFB-length GFB-length GFB-length GFB-length GFB-length GFB-length GFB-length GFB-length GFB-length GFB-length GFB-length SpectrAlign SpectrAlign Stop-best SpectrAlign Stop-best Stop-delta SpectrAlign Stop-best Stop-delta CPS-best SpectrAlign Stop-best Stop-delta CPS-best CPS-ratio SpectrAlign Stop-best Stop-delta CPS-best CPS-ratio Codon-sim-ratio SpectrAlign Stop-best Stop-delta CPS-best CPS-ratio Codon-sim-ratio AAID-best SpectrAlign Stop-best Stop-delta CPS-best CPS-ratio Codon-sim-ratio AAID-best AAsim-ratio SpectrAlign Stop-best Stop-delta CPS-best CPS-ratio Codon-sim-ratio AAID-best AAsim-ratio AAsim-best SpectrAlign Stop-best Stop-delta CPS-best CPS-ratio Codon-sim-ratio AAID-best AAsim-ratio AAsim-best AAID-ratio SpectrAlign Stop-best Stop-delta CPS-best CPS-ratio Codon-sim-ratio AAID-best AAsim-ratio AAsim-best AAID-ratio Ns/Nns-best SpectrAlign Stop-best Stop-delta CPS-best CPS-ratio Codon-sim-ratio AAID-best AAsim-ratio AAsim-best AAID-ratio Ns/Nns-best Ns/Nns-ratio SpectrAlign Stop-best Stop-delta CPS-best CPS-ratio Codon-sim-ratio AAID-best AAsim-ratio AAsim-best AAID-ratio Ns/Nns-best Ns/Nns-ratio GFB-ntID SpectrAlign Stop-best Stop-delta CPS-best CPS-ratio Codon-sim-ratio AAID-best AAsim-ratio AAsim-best AAID-ratio Ns/Nns-best Ns/Nns-ratio GFB-ntID Tv/Subs SpectrAlign Stop-best Stop-delta CPS-best CPS-ratio Codon-sim-ratio AAID-best AAsim-ratio AAsim-best AAID-ratio Ns/Nns-best Ns/Nns-ratio GFB-ntID Tv/Subs GC-probe SpectrAlign Stop-best Stop-delta CPS-best CPS-ratio Codon-sim-ratio AAID-best AAsim-ratio AAsim-best AAID-ratio Ns/Nns-best Ns/Nns-ratio GFB-ntID Tv/Subs GC-probe GC-target Long CSTs W T 0 0 1 0 1 1 0 1 1 1 0 1 1 0 0 0 1 1 0 0 1 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 1 1 1 1 1 0 1 1 0 0 1 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 L 1 1 0 1 0 0 0 0 0 0 1 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 0 1 0 1 0 0 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 Short CSTs W T 0 0 1 0 1 1 0 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 1 0 0 0 0 0 1 1 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 1 1 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 L 1 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 1 0 1 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0 0 0 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 0 0 1 1 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 0 0 1 1