A semi-supervised learning approach to predict synthetic genetic interactions by combining functional and topological properties of functional gene network Zhu-Hong You1, 2, 3, Zheng Yin3, Kyungsook Han4, De-Shuang Huang1§ and Xiaobo Zhou 3§ 1 Intelligent Computing Lab, Institute of Intelligent Machine, Chinese Academy of Science, P.O. Box 1130, Hefei, Anhui 230031China 2 Department of Automation, University of Science and Technology of China, Hefei, Anhui 230027, China 3 The Methodist Hospital Research Institute, Weill Medical College, Cornell University, Houston, TX 77030, USA 4 School of Computer Science and Engineering, Inha University, Incheon, South Korea Part One: Brief descriptions of the SVM classifier The SVM problem can be solved using quadratic programming techniques, using an optimization algorithm where the working set selection is based on steepest feasible descent. SVM has many advanced properties, including the ability to handle large feature space, effective avoidance of overfitting, etc. Specifically, the quadratic programming problem can be formulated as: where denotes an input vector, class or corresponding to whether belongs to the class, e.g. synthetic genetic interaction class or non-interaction class in our case. presents the number of training sample. controls the trade off between is a regularization parameter that margin represents the kernel function. formulation. An unlabeled input vector and classification error. is the solutions of the dual can be classified by the below discriminant function. The input vector is classified to the or interaction class or non-interaction class in our case, if class, e.g. synthetic genetic is positive and vice versa. parameters setting: Choosing a correct kernel is no free-lunch and the research is ongoing on optimizing the kernel design. The kernel functions can be linear or non-linear (Gaussian). The linear kernel function reduces to a linear equation on the original attributes in the training data. Based on our experience, linear kernel works well when there are many attributes (more that 100) in the training data, otherwise the Gaussian (RBF) kernel is used. The Gaussian (RBF) kernel non-linearly maps samples into a higher dimensional space, unlike the linear kernel, can handle the case when the relation between class labels and attributes is non-linear. Actually, most of researchers suggest that in general RBF kernel is a reasonable first choice (REF: A Practical Guide to Support Vector Classification. Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin ). Furthermore, the linear kernel is a special case of RBF (REF: Keerthi SS, Lin CJ: Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Comput 2003, 15(7):1667-1689). In our case, the prior knowledge of our problem at hand guides us to choose Gaussian (FBF) kernel. SVM with the Gaussian (RBF) kernel have been popular for practical use. Model selection in this class of SVM involves two hyperparameters: the penalty parameter C and the kernel width Gamma. If complete model selection using the Gaussian kernel has been conducted, there is no need to consider linear SVM. A grid search method can be used to try values of each parameter across the specified search range using geometric steps. Grid searches are computationally expensive because the model must be evaluated at many points within the grid for each parameter. For example, if a grid search is used with 10 search intervals and an RBF kernel function is used with two parameters (C and Gamma), then the model must be evaluated at 10*10 = 100 grid points. The grid search will find a region near the global optimum point. In our case, it is with an affordable computational complexity. Using the linear kernel it is also need to search for a parameter C. Actually the RBF is at least as good as linear. In current work, we think the SVM with RBF is enough for a baseline comparison. Part Two: Figures show the probability density distribution of different network properties across synthetic genetic interactions and non-interaction gene pairs. Supplementary Figure S1 The figure shows the probability density distribution of the average of degree centrality across a pair of genes in case of the synthetic genetic interaction pairs (blue line) and non-synthetic genetic interaction pairs (red dashed line) in the binary network (protein interaction network). Numbers in each plot indicate the D-statistic associated with the Kolmogorov-Smirnov test for the difference between the two distributions and the corresponding P-value. Supplementary Figure S2 The figure shows the probability density distribution of the absolute difference of degree centrality across a pair of genes in case of the synthetic genetic interaction pairs (blue line) and non-synthetic genetic interaction pairs (red dashed line) in the binary network (protein interaction network). Numbers in each plot indicate the D-statistic associated with the Kolmogorov-Smirnov test for the difference between the two distributions and the corresponding P-value. Supplementary Figure S3 The figure shows the probability density distribution of the average of closeness centrality across a pair of genes in case of the synthetic genetic interaction pairs (blue line) and non-synthetic genetic interaction pairs (red dashed line) in the binary network (protein interaction network). Numbers in each plot indicate the D-statistic associated with the Kolmogorov-Smirnov test for the difference between the two distributions and the corresponding P-value. Supplementary Figure S4 The figure shows the probability density distribution of the absolute difference of closeness centrality across a pair of genes in case of the synthetic genetic interaction pairs (blue line) and non-synthetic genetic interaction pairs (red dashed line) in the binary network (protein interaction network). Numbers in each plot indicate the D-statistic associated with the Kolmogorov-Smirnov test for the difference between the two distributions and the corresponding P-value. Supplementary Figure S5 The figure shows the probability density distribution of the average of betweeness centrality across a pair of genes in case of the synthetic genetic interaction pairs (blue line) and non-synthetic genetic interaction pairs (red dashed line) in the binary network (protein interaction network). Numbers in each plot indicate the D-statistic associated with the Kolmogorov-Smirnov test for the difference between the two distributions and the corresponding P-value. Supplementary Figure S6 The figure shows the probability density distribution of the absolute difference of betweenness centrality across a pair of genes in case of the synthetic genetic interaction pairs (blue line) and non-synthetic genetic interaction pairs (red dashed line) in the binary network (protein interaction network). Numbers in each plot indicate the D-statistic associated with the Kolmogorov-Smirnov test for the difference between the two distributions and the corresponding P-value. Supplementary Figure S7 The figure shows the probability density distribution of the average of clustering coefficient across a pair of genes in case of the synthetic genetic interaction pairs (blue line) and Non-synthetic genetic interaction pairs (red dashed line) in the binary network (protein interaction network). Numbers in each plot indicate the D-statistic associated with the Kolmogorov-Smirnov test for the difference between the two distributions and the corresponding P-value. Supplementary Figure S8 The figure shows the probability density distribution of the absolute difference of clustering coefficient across a pair of genes in case of the synthetic genetic interaction pairs (blue line) and non-synthetic genetic interaction pairs (red dashed line) in the weighted network (functional gene network). Numbers in each plot indicate the D-statistic associated with the Kolmogorov-Smirnov test for the difference between the two distributions and the corresponding P-value. Supplementary Figure S9 The figure shows the probability density distribution of the average of degree centrality across a pair of genes in case of the synthetic genetic interaction pairs (blue line) and non-synthetic genetic interaction pairs (red dashed line) in the weighted network (functional gene network). Numbers in each plot indicate the D-statistic associated with the Kolmogorov-Smirnov test for the difference between the two distributions and the corresponding P-value. Supplementary Figure S10 The figure shows the probability density distribution of the absolute difference of degree centrality across a pair of genes in case of the synthetic genetic interaction pairs (blue line) and non-synthetic genetic interaction pairs (red dashed line) in the weighted network (functional gene network). Numbers in each plot indicate the D-statistic associated with the Kolmogorov-Smirnov test for the difference between the two distributions and the corresponding P-value. Supplementary Figure S11 The figure shows the probability density distribution of the average of closeness centrality across a pair of genes in case of the synthetic genetic interaction pairs (blue line) and non-synthetic genetic interaction pairs (red dashed line) in the weighted network (functional gene network). Numbers in each plot indicate the D-statistic associated with the Kolmogorov-Smirnov test for the difference between the two distributions and the corresponding P-value. Supplementary Figure S12 The figure shows the probability density distribution of the absolute difference of closeness centrality across a pair of genes in case of the synthetic genetic interaction pairs (blue line) and non-synthetic genetic interaction pairs (red dashed line) in the weighted network (functional gene network). Numbers in each plot indicate the D-statistic associated with the Kolmogorov-Smirnov test for the difference between the two distributions and the corresponding P-value. Supplementary Figure S13 The figure shows the probability density distribution of the average of betweeness centrality across a pair of genes in case of the synthetic genetic interaction pairs (blue line) and non-synthetic genetic interaction pairs (red dashed line) in the weighted network (functional gene network). Numbers in each plot indicate the D-statistic associated with the Kolmogorov-Smirnov test for the difference between the two distributions and the corresponding P-value. Supplementary Figure S14 The figure shows the probability density distribution of the absolute difference of betweenness centrality across a pair of genes in case of the synthetic genetic interaction pairs (blue line) and non-synthetic genetic interaction pairs (red dashed line) in the weighted network (functional gene network). Numbers in each plot indicate the D-statistic associated with the Kolmogorov-Smirnov test for the difference between the two distributions and the corresponding P-value. Supplementary Figure S15 The figure shows the probability density distribution of the average of clustering coefficient across a pair of genes in case of the synthetic genetic interaction pairs (blue line) and non-synthetic genetic interaction pairs (red dashed line) in the weighted network (functional gene network). Numbers in each plot indicate the D-statistic associated with the Kolmogorov-Smirnov test for the difference between the two distributions and the corresponding P-value. Supplementary Figure S16 The figure shows the probability density distribution of the absolute difference of clustering coefficient across a pair of genes in case of the synthetic genetic interaction pairs (blue line) and non-synthetic genetic interaction pairs (red dashed line) in the weighted network (functional gene network). Numbers in each plot indicate the D-statistic associated with the Kolmogorov-Smirnov test for the difference between the two distributions and the corresponding P-value. 20 40 60 80 KSStat=0.0364 P-value=0.9 0 Probability density 120 Degree(Binary Network) 0.00 0.02 0.04 0.06 The average of gene-pairs Figure S1. 0.08 50 100 KSStat=0.0587 P-value=1.00E-17 0 Probability density 150 Degree(Binary Network) 0.00 0.02 0.04 0.06 0.08 The absolute difference of gene-pairs Figure S2. 15 5 10 KSStat=0.0212 P-value=0.0108 0 Probability density 20 Closeness Centrality(Binary Network) 0.0 0.1 0.2 0.3 The average of gene-pairs Figure S3. 0.4 Closeness Centrality(Binary Network) 0 5 10 15 Probability density 20 KSStat=0.0587 P-value=1.00E-17 0.0 0.1 0.2 0.3 0.4 The absolute difference of gene-pairs Figure S4. 500 1000 KSStat=0.0319 P-value=1.55E-05 0 Probability density 1500 Betweenness Centrality(Binary Network) 0.000 0.002 0.004 The average of gene-pairs 0.006 0.008 Figure S5. 400 800 1200 KSStat=0.0313 P-value=2.35E-05 0 Probability density Betweenness Centrality(Binary Network) 0.000 0.002 0.008 0.006 0.004 The absolute difference of gene-pairs Figure S6. 1 2 3 KSStat=0.0679 P-value=1.41E-23 0 Probability density 4 Clustering Coefficients(Binary Network) 0.0 0.2 0.4 0.6 The average of gene-pairs 0.8 1.0 Figure S7. Clustering Coefficients(Binary Network) 3 2 1 0 Probability density 4 KSStat=0.0615 P-value=1.97E-19 0.0 0.2 0.4 0.6 0.8 1.0 The absolute difference of gene-pairs Figure S8. Degree(Weighted Network) 0.004 0.000 Probability density 0.008 KSStat=0.0261 P-value=0.0011 0 200 400 600 The average of gene-pairs 800 Figure S9. 0.004 0.008 KSStat=0.0715 P-value=5.7137e-025 0.000 Probability density 0.012 Degree(Weighted Network) 0 100 200 300 400 The absolute difference of gene-pairs Figure S10. 0.5 1.0 1.5 KSStat=0.0385 P-value=1.4833e-007 0.0 Probability density 2.0 2.5 Closeness Centrality(Weighted Network) 0.0 0.5 1.0 1.5 The average of gene-pairs Figure S11. 2.0 0.5 1.0 1.5 2.0 KSStat=0.0441 P-value=9.3646e-010 0.0 Probability density 2.5 Closeness Centrality(Weighted Network) 0.0 0.5 1.0 1.5 The absolute difference of gene-pairs Figure S12. 2e-05 4e-05 6e-05 KSStat=0.0529 P-value=7.0540e-014 0e+00 Probability density Betweenness Centrality(Weighted Network) 0 5000 10000 15000 The average of gene-pairs Figure S13. 20000 25000 0e+00 2e-05 4e-05 KSStat=0.0414 P-value=1.1590e-008 0 10000 20000 30000 40000 50000 60000 The absolute difference of gene-pairs Figure S14. Clustering Coefficients(Weighted Network) 1.0 2.0 3.0 KSStat=0.0691 P-value=2.3655e-024 0.0 Probability density Probability density 6e-05 Betweenness Centrality(Weighted Network) 0.0 0.2 0.4 0.6 The average of gene-pairs Figure S15. 0.8 1.0 4 Clustering Coefficients(Weighted Network) 3 2 1 0 Probability density KSStat=0.0571 P-value=4.4346e-026 0.0 0.2 0.4 0.6 0.8 1.0 The absolute difference of gene-pairs Figure S16. Part Three: Figures show the empirical cumulative distributions of different network properties across synthetic genetic interactions and non-interaction gene pairs. Supplementary Figure S17 The figure shows the empirical cumulative distribution of the average of degree centrality across a pair of genes in case of the synthetic genetic interaction pairs and non-synthetic genetic interaction pairs in the binary network. Supplementary Figure S18 The figure shows the empirical cumulative distribution of the average of degree centrality across a pair of genes in case of the synthetic genetic interaction pairs and non-synthetic genetic interaction pairs in weighted functional gene network. Supplementary Figure S19 The figure shows the empirical cumulative distribution of the absolute difference of degree centrality across a pair of genes in case of the synthetic genetic interaction pairs and non-synthetic genetic interaction pairs in the binary network. Supplementary Figure S20 The figure shows the empirical cumulative distribution of the absolute difference of degree centrality across a pair of genes in case of the synthetic genetic interaction pairs and non-synthetic genetic interaction pairs in weighted functional gene network. Supplementary Figure S21 The figure shows the empirical cumulative distribution of the average of closeness centrality across a pair of genes in case of the synthetic genetic interaction pairs and non-synthetic genetic interaction pairs in the binary network. Supplementary Figure S22 The figure shows the empirical cumulative distribution of the average of closeness centrality across a pair of genes in case of the synthetic genetic interaction pairs and non-synthetic genetic interaction pairs in weighted functional gene network. Supplementary Figure S23 The figure shows the empirical cumulative distribution of the absolute difference of closeness centrality across a pair of genes in case of the synthetic genetic interaction pairs and non-synthetic genetic interaction pairs in the binary network. Supplementary Figure S24 The figure shows the empirical cumulative distribution of the absolute difference of closeness centrality across a pair of genes in case of the synthetic genetic interaction pairs and non-synthetic genetic interaction pairs in weighted functional gene network. Supplementary Figure S25 The figure shows the empirical cumulative distribution of the average of betweenness centrality across a pair of genes in case of the synthetic genetic interaction pairs and non-synthetic genetic interaction pairs in the binary network. Supplementary Figure S26 The figure shows the empirical cumulative distribution of the average of betweenness centrality across a pair of genes in case of the synthetic genetic interaction pairs and non-synthetic genetic interaction pairs in weighted functional gene network. Supplementary Figure S27 The figure shows the empirical cumulative distribution of the absolute difference of betweenness centrality across a pair of genes in case of the synthetic genetic interaction pairs and non-synthetic genetic interaction pairs in the binary network. Supplementary Figure S28 The figure shows the empirical cumulative distribution of the absolute difference of betweenness centrality across a pair of genes in case of the synthetic genetic interaction pairs and non-synthetic genetic interaction pairs in weighted functional gene network. Supplementary Figure S29 The figure shows the empirical cumulative distribution of the average of clustering coefficient across a pair of genes in case of the synthetic genetic interaction pairs and non-synthetic genetic interaction pairs in the binary network. Supplementary Figure S30 The figure shows the empirical cumulative distribution of the average of clustering coefficient across a pair of genes in case of the synthetic genetic interaction pairs and non-synthetic genetic interaction pairs in weighted functional gene network. Supplementary Figure S31 The figure shows the empirical cumulative distribution of the absolute difference of clustering coefficient across a pair of genes in case of the synthetic genetic interaction pairs and non-synthetic genetic interaction pairs in the binary network. Supplementary Figure S32 The figure shows the empirical cumulative distribution of the absolute difference of clustering coefficient across a pair of genes in case of the synthetic genetic interaction pairs and non-synthetic genetic interaction pairs in weighted functional gene network. Empirical CDF ( Degree of Binary Networks ) 1 0.9 0.8 0.7 F(x) 0.6 0.5 0.4 0.3 0.2 SGI Gene Pairs Non-SGI Gene Pairs 0.1 0 0 0.005 0.01 0.015 0.02 The average of gene pairs Figure S17. 0.025 0.03 0.035 Empirical CDF (Degree of Weighted Networks) 1 0.9 0.8 0.7 F(x) 0.6 0.5 0.4 0.3 0.2 SGI Gene Pairs Non-SGI Gene Pairs 0.1 0 0 100 200 300 400 500 600 The average of gene-pairs Figure S18. Empirical CDF ( Degree of Binary Networks ) 1 0.9 0.8 0.7 F(x) 0.6 0.5 0.4 0.3 0.2 SGI Gene Pairs Non-SGI Gene Pairs 0.1 0 0 0.005 0.01 0.015 0.02 0.025 The absolute difference of gene pairs Figure S19. 0.03 0.035 0.04 Empirical CDF (Degree of Weighted Networks) 1 0.9 0.8 0.7 F(x) 0.6 0.5 0.4 0.3 0.2 SGI Gene Pairs Non-SGI Gene Pairs 0.1 0 0 100 200 300 400 500 600 The absolute difference of gene-pairs Figure S20. Empirical CDF ( Closeness Centrality of Binary Networks ) 1 SGI Gene Pairs 0.9 Non-SGI Gene Pairs 0.8 0.7 F(x) 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.05 0.1 0.15 0.2 0.25 The average of gene pairs Figure S21. 0.3 0.35 0.4 Empirical CDF (Closeness Centrality of Weighted Networks) 1 0.9 0.8 0.7 F(x) 0.6 0.5 0.4 0.3 0.2 SGI Gene Pairs Non-SGI Gene Pairs 0.1 0 0.8 1 1.2 1.4 1.6 1.8 2 2.2 The average of gene-pairs Figure S22. Empirical CDF (Closeness Centrality of Binary Networks) 1 0.9 0.8 0.7 F(x) 0.6 0.5 0.4 0.3 0.2 SGI Gene Pairs Non-SGI Gene Pairs 0.1 0 0 0.05 0.1 0.15 0.2 0.25 The absolute difference of gene pairs Figure S23. 0.3 0.35 0.4 Empirical CDF ( Closeness Centrality of Weighted Networks ) 1 0.9 0.8 0.7 F(x) 0.6 0.5 0.4 0.3 0.2 SGI Gene Pairs Non-SGI Gene Pairs 0.1 0 0 0.2 0.4 0.6 0.8 1 1.2 The absolute difference of gene pairs Figure S24. Empirical CDF ( Betweenness Centrality of Binary Network ) 1 0.9 0.8 0.7 F(x) 0.6 0.5 0.4 0.3 0.2 SGI Gene Pairs Non-SGI Gene Pairs 0.1 0 0 1 2 3 The average of gene pairs Figure S25. 4 5 6 -3 x 10 Empirical CDF (Betweenness Centrality of Weighted Networks) 1 0.9 0.8 0.7 F(x) 0.6 0.5 0.4 0.3 0.2 SGI Gene Pairs Non-SGI Gene Pairs 0.1 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 The average of gene-pairs 5 x 10 Figure S26. Empirical CDF ( Betweenness Centrality of Binary Networks ) 1 0.9 0.8 0.7 F(x) 0.6 0.5 0.4 0.3 0.2 SGI Gene Pairs Non-SGI Gene Pairs 0.1 0 0 1 2 3 4 5 The absolute difference of gene pairs Figure S27. 6 7 8 -3 x 10 Empirical CDF ( Betweenness Centrality of Weighted Networks ) 1 0.9 0.8 0.7 F(x) 0.6 0.5 0.4 0.3 0.2 SGI Gene Pairs Non-SGI Gene Pairs 0.1 0 0 0.5 1 1.5 2 2.5 3 3.5 The absolute difference of gene pairs 5 x 10 Figure S28. Empirical CDF ( Clustering Coefficient of Binary Networks ) 1 0.9 0.8 0.7 F(x) 0.6 0.5 0.4 0.3 0.2 SGI Gene Pairs Non-SGI Gene Pairs 0.1 0 0 0.1 0.2 0.3 0.4 0.5 The average of gene pairs Figure S29. 0.6 0.7 0.8 Empirical CDF (Clustering Coefficient of Weighted Networks) 1 0.9 0.8 0.7 0.5 0.4 0.3 0.2 SGI Gene Pairs Non-SGI Gene Pairs 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 The Average of Gene Pairs Figure S30. Empirical CDF ( Clustering Coefficient of Binary Networks) 1 0.9 0.8 0.7 0.6 F(x) F(x) 0.6 0.5 0.4 0.3 0.2 SGI Gene Pairs Non-SGI Gene Pairs 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 The absolute difference of gene pairs Figure S31. 0.8 0.9 1 Empirical CDF (Clustering Coefficient of Weighted Netowks) 1 0.9 0.8 0.7 F(x) 0.6 0.5 0.4 0.3 0.2 SGI Gene Pairs Non-SGI Gene Pairs 0.1 0 0 0.1 0.2 0.3 0.4 The absolute difference of gene pairs Figure S32. 0.5 0.6 0.7