Supplementary Material Performance measures Performance of the HBC-Evo system is reported using following quality measures of accuracy (Acc), sensitivity (Sn), specificity (Sp), and F-score. These quality measures are defined as: Sn ( TP ) TP FN TN ) TN FP TP TN Acc ( ) TP TN FP FN Sp ( Prc ( TP ) TP FP Fscore 2( Prc * Sn ) ( Prc Sn) (S1) (S2) (S3) (S4) (S5) where, TP = true positives, TN = true negatives, FP = false positives, and FN = false negatives. Yule’s Q -statistic is used to measure the diversity of individual classifier. This diversity will enhance the prediction performance of the proposed ensemble. This statistic is used for two base classifiers C i and C j as Qi , j ad bc , where a and d represent the scores of ad bc Ci and C j classifiers for correct and incorrect predictions, respectively. Whereas, b shows the score when Ci classifier is correct and C j is incorrect; c is the score of C j classifier being correct and Ci incorrect. This measure is related to the distance measure that finds the normalized difference between the agreement and disagreement of the two classifiers. Q values lie in the range [-1, 1]. The two classifiers are more independent, if the value of Q is closer to zero. Positive value of Q indicates two classifiers tend to agree on same decisions. However, its negative value specifies two classifiers tend to commit errors on different objects. To generalize the pairwise diversity measures to an entire ensemble, we took average Q values of all pair of individual classifiers. For m base classifiers, the averaged Q over all pairs of classifiers is calculated as: m 1 m 2 Qavg (S6) Qi ,k m(m 1) i 1 k i 1 S1 Supplementary Table S1 Single letter codes and numerical values of Hydrophobicity (Hd) and Hydrophilicity (Hb) of twenty native amino acid molecules of protein 1. Alanine A Hd (Tanford 1962) 0.62 2. Cysteine C 0.29 -1 3. Aspartic acid D -0.9 3 4. Glutamic acid E -0.74 3.0 5. Phenylalanine F 1.19 -2.5 6. Glycine G 0.48 0 7. Histidine H -0.4 -0.5 8. Isoleucine I 1.38 -1.8 9. Lysine K -1.5 3.0 10. Leucine L 1.06 -1.8 11. Methionine M 0.64 -1.3 12. Asparagine N -0.78 0.2 13. Proline P 0.12 0 14. Glutamine Q -0.85 0.2 15. Arginine R -2.53 3.0 16. Serine S -0.18 0.3 17. Threonine T -0.05 -0.4 18. Valine V 1.08 -1.5 19. Tryptophan W 0.81 -3.4 20. Tyrosine Y 0.26 -2.3 Sr. No. Amino acid Single letter code Hb (Hopp and Woods 1981) -0.5 S2 60 Breast-cancer protein Cancer protein 50 40 Percentage difference 30 20 10 0 -10 -20 -30 -40 A C D E F G H I K L M N P Q R S T V W Y Variation of amino acids in cancerous proteins Supplementary Fig. S1 Variation of amino acid molecules in datasets related to cancer and breast cancer protein sequences with reference to non-cancer We have trained individually four diverse types of computational algorithms (NB, KNN, SVM, and RF) in various feature spaces of AAC (60 dimensions), SAAC (60 dimensions), PseAAC-S (60 dimensions) and PseAAC-P (60 dimensions). Detail information of NB, KNN, SVM, and RF algorithms is available in the literature of computational intelligence. Libsvm software is employed to implement SVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm) (Chang and Lin 2011). Supplementary Table S2 provides the summary of necessary parameters to generate optimal numerical expressions. Several GP simulations were performed to adjust the parameters using GPLAB3 toolbox. This toolbox can be found at http://gplab.sourceforge.net/download.html. At the end, these numerical expressions are used to predict healthy and cancerous tissues. S3 Supplementary Table S2 Summary of GP parameters setting for the HBC-Evo system GP Parameters Values ˆ X X X X , where Terminals/nonSet of predictions of based-level classifiers X 1 2 3 4 terminal set X 1 ,X 2 ,X 3 , and X 4 indicate the predicted values of four based-level classifiers. Parameters set, +, -,, , log, sin, cos, exp, power, constants, Fitness criterion Selection method Pop. size Gen. Pop. Init. Max. Tree depth Sampling Expected offspring Operators probabilities Survival criterion where constants are random numbers in the range of [0, 1] “AUC” “Generational” 80 180 and 140 for C/NC and B/NBC datasets, respectively “Ramped half and half” 30 “Tournament” "rank89” Automatically adjustable, variable crossover/mutation ratio Keep the best candidate solution Supplementary Table S3 Performance of the base-level classifiers in terms of AUC, accuracy (Acc), specificity (Sp), sensitivity (Sn), and F-score measures for C/NC and B/NBC datasets of amino acid sequences C/NC B/NBC AUC Acc Sp Sn F-score AUC Acc Sp Sn F-score AAC NB KNN SVM RF 96.59 98.02 98.29 99.33 90.76 93.11 96.99 96.76 82.06 93.29 95.50 96.78 99.40 92.97 98.50 96.78 91.50 93.10 97.10 96.78 97.98 98.20 98.30 99.01 93.82 92.60 94.01 96.45 88.96 93.60 91.31 97.30 98.70 91.66 96.69 95.60 94.14 92.52 94.18 96.42 SAAC NB KNN SVM RF 97.82 96.88 98.92 99.40 88.98 91.06 96.63 96.84 77.90 90.73 95.28 97.20 100.0 91.31 98.02 96.40 90.10 91.10 96.70 96.79 98.70 97.70 98.89 98.96 93.46 91.88 94.10 96.40 86.92 90.49 88.20 96.80 100.0 93.27 100.0 96.03 93.85 91.98 94.42 96.39 PseAAC-S NB KNN SVM RF 97.29 98.57 99.10 99.56 90.82 95.41 92.12 97.93 96.43 96.51 84.53 98.28 85.19 94.32 99.78 97.56 90.25 95.36 92.69 97.93 94.32 98.82 98.50 99.20 88.10 94.90 94.07 96.82 90.50 94.98 90.89 97.62 85.67 94.85 97.20 96.04 87.78 94.90 94.22 96.84 PseAAC-P Feature/Method NB KNN SVM RF 97.47 98.30 99.10 99.49 93.14 94.50 96.20 96.92 86.21 96.29 93.20 97.70 100.0 92.71 99.20 96.20 93.58 94.40 96.29 96.90 98.21 98.85 98.66 99.20 94.31 95.47 94.31 97.10 90.17 94.85 89.81 96.89 98.49 96.03 98.84 97.33 94.57 95.50 94.58 97.10 S4 Supplementary Table S4 Prediction performance of the proposed HBC-Evo in terms of specificity (Sp), sensitivity (Sn), and F-score measures for C/NC and B/NBC datasets of amino acid sequences Proposed HBC-Evo EnsAAC EnsSAAC EnsPseAAC-S EnsPseAAC-P C/NC B/NBC Sp Sn F-score Sp Sn F-score 98.60 98.60 98.74 97.33 99.10 99.20 99.31 99.41 98.85 98.89 99.01 98.39 98.73 97.65 99.35 98.20 96.91 97.74 97.45 98.39 97.80 97.69 98.36 98.30 Best GP individual complexity For C/NC dataset, Supplementary Fig.S1a shows the complexity of the best GP individual, in each generation, against the number of generations. This figure demonstrates the complexity of the best individual in various feature spaces. It is observed that after 18 generations, the best-fit individual, for PseAAC-S space, converges to the optimal point. Fig.S1b depicts the complexity of the best individual with the number of generations. It is observed that for SAAC space, the average tree depth is large. However, PseAAC-S has reveled average nodes and tree depth relatively small as compared to other feature spaces. S5 (a) (b) Supplementary Fig. S2 For cancer dataset, complexity of the best GP individual in each generation with respect to (a) fitness criterion (b) number of nodes and level of tree depth against the number of generations S6 ˆ) FPseAAC P ( X Supplementary Fig. S3 For cancer dataset, tree of the best individual of the HBC-Evo in PseAAC-P feature space Supplementary Reference Chang C-C, Lin C-J (2011) LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2 (27):1-27 Hopp TP, Woods KR (1981) Prediction of protein antigenic determinants from amino acid sequences. National Acad Sciences 78 (6):3824-3828 Tanford C (1962) Contribution of hydrophobic interactions to the stability of the globular conformation of proteins. Journal of the American Chemical Society 84 (22):4240-4247 S7