file - BioMed Central

advertisement
A semi-supervised learning approach to predict synthetic
genetic interactions by combining functional and
topological properties of functional gene network
Zhu-Hong You1, 2, 3, Zheng Yin3, Kyungsook Han4, De-Shuang Huang1§ and
Xiaobo Zhou 3§
1
Intelligent Computing Lab, Institute of Intelligent Machine, Chinese Academy of Science, P.O.
Box 1130, Hefei, Anhui 230031China
2
Department of Automation, University of Science and Technology of China, Hefei, Anhui
230027, China
3
The Methodist Hospital Research Institute, Weill Medical College, Cornell University, Houston,
TX 77030, USA
4
School of Computer Science and Engineering, Inha University, Incheon, South Korea
Part One: Brief descriptions of the SVM classifier
The SVM problem can be solved using quadratic programming techniques, using an
optimization algorithm where the working set selection is based on steepest feasible
descent. SVM has many advanced properties, including the ability to handle large feature
space, effective avoidance of overfitting, etc. Specifically, the quadratic programming
problem can be formulated as:
where
denotes an input vector,
class or
corresponding to whether
belongs to the
class, e.g. synthetic genetic interaction class or non-interaction class in
our case.
presents the number of training sample.
controls
the
trade
off
between
is a regularization parameter that
margin
represents the kernel function.
formulation. An unlabeled input vector
and
classification
error.
is the solutions of the dual
can be classified by the below discriminant
function.
The input vector
is classified to the
or
interaction class or non-interaction class in our case, if
class, e.g. synthetic genetic
is positive and vice versa.
parameters setting:
Choosing a correct kernel is no free-lunch and the research is
ongoing on optimizing the kernel design. The kernel functions can be linear or non-linear
(Gaussian). The linear kernel function reduces to a linear equation on the original
attributes in the training data. Based on our experience, linear kernel works well when
there are many attributes (more that 100) in the training data, otherwise the Gaussian
(RBF) kernel is used.
The Gaussian (RBF) kernel non-linearly maps samples into a higher dimensional space,
unlike the linear kernel, can handle the case when the relation between class labels and
attributes is non-linear. Actually, most of researchers suggest that in general RBF kernel
is a reasonable first choice (REF: A Practical Guide to Support Vector Classification.
Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin ). Furthermore, the linear kernel is
a special case of RBF (REF: Keerthi SS, Lin CJ: Asymptotic behaviors of support vector
machines with Gaussian kernel. Neural Comput 2003, 15(7):1667-1689). In our case, the
prior knowledge of our problem at hand guides us to choose Gaussian (FBF) kernel.
SVM with the Gaussian (RBF) kernel have been popular for practical use. Model
selection in this class of SVM involves two hyperparameters: the penalty parameter C
and the kernel width Gamma. If complete model selection using the Gaussian kernel has
been conducted, there is no need to consider linear SVM. A grid search method can be
used to try values of each parameter across the specified search range using geometric
steps. Grid searches are computationally expensive because the model must be evaluated
at many points within the grid for each parameter. For example, if a grid search is used
with 10 search intervals and an RBF kernel function is used with two parameters (C and
Gamma), then the model must be evaluated at 10*10 = 100 grid points. The grid search
will find a region near the global optimum point. In our case, it is with an affordable
computational complexity. Using the linear kernel it is also need to search for a parameter
C. Actually the RBF is at least as good as linear. In current work, we think the SVM with
RBF is enough for a baseline comparison.
Part Two: Figures show the probability density distribution of different
network properties across synthetic genetic interactions and non-interaction
gene pairs.
Supplementary Figure S1
The figure shows the probability density distribution of the average of
degree centrality across a pair of genes in case of the synthetic genetic
interaction pairs (blue line) and non-synthetic genetic interaction pairs (red
dashed line) in the binary network (protein interaction network). Numbers in
each plot indicate the D-statistic associated with the Kolmogorov-Smirnov
test for the difference between the two distributions and the corresponding
P-value.
Supplementary Figure S2
The figure shows the probability density distribution of the absolute
difference of degree centrality across a pair of genes in case of the synthetic
genetic interaction pairs (blue line) and non-synthetic genetic interaction
pairs (red dashed line) in the binary network (protein interaction network).
Numbers in each plot indicate the D-statistic associated with the
Kolmogorov-Smirnov test for the difference between the two distributions
and the corresponding P-value.
Supplementary Figure S3
The figure shows the probability density distribution of the average of
closeness centrality across a pair of genes in case of the synthetic genetic
interaction pairs (blue line) and non-synthetic genetic interaction pairs (red
dashed line) in the binary network (protein interaction network). Numbers in
each plot indicate the D-statistic associated with the Kolmogorov-Smirnov
test for the difference between the two distributions and the corresponding
P-value.
Supplementary Figure S4
The figure shows the probability density distribution of the absolute
difference of closeness centrality across a pair of genes in case of the
synthetic genetic interaction pairs (blue line) and non-synthetic genetic
interaction pairs (red dashed line) in the binary network (protein interaction
network). Numbers in each plot indicate the D-statistic associated with the
Kolmogorov-Smirnov test for the difference between the two distributions
and the corresponding P-value.
Supplementary Figure S5
The figure shows the probability density distribution of the average of
betweeness centrality across a pair of genes in case of the synthetic genetic
interaction pairs (blue line) and non-synthetic genetic interaction pairs (red
dashed line) in the binary network (protein interaction network). Numbers in
each plot indicate the D-statistic associated with the Kolmogorov-Smirnov
test for the difference between the two distributions and the corresponding
P-value.
Supplementary Figure S6
The figure shows the probability density distribution of the absolute
difference of betweenness centrality across a pair of genes in case of the
synthetic genetic interaction pairs (blue line) and non-synthetic genetic
interaction pairs (red dashed line) in the binary network (protein interaction
network). Numbers in each plot indicate the D-statistic associated with the
Kolmogorov-Smirnov test for the difference between the two distributions
and the corresponding P-value.
Supplementary Figure S7
The figure shows the probability density distribution of the average of
clustering coefficient across a pair of genes in case of the synthetic genetic
interaction pairs (blue line) and Non-synthetic genetic interaction pairs (red
dashed line) in the binary network (protein interaction network). Numbers in
each plot indicate the D-statistic associated with the Kolmogorov-Smirnov
test for the difference between the two distributions and the corresponding
P-value.
Supplementary Figure S8
The figure shows the probability density distribution of the absolute
difference of clustering coefficient across a pair of genes in case of the
synthetic genetic interaction pairs (blue line) and non-synthetic genetic
interaction pairs (red dashed line) in the weighted network (functional gene
network). Numbers in each plot indicate the D-statistic associated with the
Kolmogorov-Smirnov test for the difference between the two distributions
and the corresponding P-value.
Supplementary Figure S9
The figure shows the probability density distribution of the average of
degree centrality across a pair of genes in case of the synthetic genetic
interaction pairs (blue line) and non-synthetic genetic interaction pairs (red
dashed line) in the weighted network (functional gene network). Numbers in
each plot indicate the D-statistic associated with the Kolmogorov-Smirnov
test for the difference between the two distributions and the corresponding
P-value.
Supplementary Figure S10
The figure shows the probability density distribution of the absolute
difference of degree centrality across a pair of genes in case of the synthetic
genetic interaction pairs (blue line) and non-synthetic genetic interaction
pairs (red dashed line) in the weighted network (functional gene network).
Numbers in each plot indicate the D-statistic associated with the
Kolmogorov-Smirnov test for the difference between the two distributions
and the corresponding P-value.
Supplementary Figure S11
The figure shows the probability density distribution of the average of
closeness centrality across a pair of genes in case of the synthetic genetic
interaction pairs (blue line) and non-synthetic genetic interaction pairs (red
dashed line) in the weighted network (functional gene network). Numbers in
each plot indicate the D-statistic associated with the Kolmogorov-Smirnov
test for the difference between the two distributions and the corresponding
P-value.
Supplementary Figure S12
The figure shows the probability density distribution of the absolute
difference of closeness centrality across a pair of genes in case of the
synthetic genetic interaction pairs (blue line) and non-synthetic genetic
interaction pairs (red dashed line) in the weighted network (functional gene
network). Numbers in each plot indicate the D-statistic associated with the
Kolmogorov-Smirnov test for the difference between the two distributions
and the corresponding P-value.
Supplementary Figure S13
The figure shows the probability density distribution of the average of
betweeness centrality across a pair of genes in case of the synthetic genetic
interaction pairs (blue line) and non-synthetic genetic interaction pairs (red
dashed line) in the weighted network (functional gene network). Numbers in
each plot indicate the D-statistic associated with the Kolmogorov-Smirnov
test for the difference between the two distributions and the corresponding
P-value.
Supplementary Figure S14
The figure shows the probability density distribution of the absolute
difference of betweenness centrality across a pair of genes in case of the
synthetic genetic interaction pairs (blue line) and non-synthetic genetic
interaction pairs (red dashed line) in the weighted network (functional gene
network). Numbers in each plot indicate the D-statistic associated with the
Kolmogorov-Smirnov test for the difference between the two distributions
and the corresponding P-value.
Supplementary Figure S15
The figure shows the probability density distribution of the average of
clustering coefficient across a pair of genes in case of the synthetic genetic
interaction pairs (blue line) and non-synthetic genetic interaction pairs (red
dashed line) in the weighted network (functional gene network). Numbers in
each plot indicate the D-statistic associated with the Kolmogorov-Smirnov
test for the difference between the two distributions and the corresponding
P-value.
Supplementary Figure S16
The figure shows the probability density distribution of the absolute
difference of clustering coefficient across a pair of genes in case of the
synthetic genetic interaction pairs (blue line) and non-synthetic genetic
interaction pairs (red dashed line) in the weighted network (functional gene
network). Numbers in each plot indicate the D-statistic associated with the
Kolmogorov-Smirnov test for the difference between the two distributions
and the corresponding P-value.
20
40
60
80
KSStat=0.0364
P-value=0.9
0
Probability density
120
Degree(Binary Network)
0.00
0.02
0.04
0.06
The average of gene-pairs
Figure S1.
0.08
50
100
KSStat=0.0587
P-value=1.00E-17
0
Probability density
150
Degree(Binary Network)
0.00
0.02
0.04
0.06
0.08
The absolute difference of gene-pairs
Figure S2.
15
5
10
KSStat=0.0212
P-value=0.0108
0
Probability density
20
Closeness Centrality(Binary Network)
0.0
0.1
0.2
0.3
The average of gene-pairs
Figure S3.
0.4
Closeness Centrality(Binary Network)
0
5
10
15
Probability density
20
KSStat=0.0587
P-value=1.00E-17
0.0
0.1
0.2
0.3
0.4
The absolute difference of gene-pairs
Figure S4.
500
1000
KSStat=0.0319
P-value=1.55E-05
0
Probability density
1500
Betweenness Centrality(Binary Network)
0.000
0.002
0.004
The average of gene-pairs
0.006
0.008
Figure S5.
400
800
1200
KSStat=0.0313
P-value=2.35E-05
0
Probability density
Betweenness Centrality(Binary Network)
0.000
0.002
0.008
0.006
0.004
The absolute difference of gene-pairs
Figure S6.
1
2
3
KSStat=0.0679
P-value=1.41E-23
0
Probability density
4
Clustering Coefficients(Binary Network)
0.0
0.2
0.4
0.6
The average of gene-pairs
0.8
1.0
Figure S7.
Clustering Coefficients(Binary Network)
3
2
1
0
Probability density
4
KSStat=0.0615
P-value=1.97E-19
0.0
0.2
0.4
0.6
0.8
1.0
The absolute difference of gene-pairs
Figure S8.
Degree(Weighted Network)
0.004
0.000
Probability density
0.008
KSStat=0.0261
P-value=0.0011
0
200
400
600
The average of gene-pairs
800
Figure S9.
0.004
0.008
KSStat=0.0715
P-value=5.7137e-025
0.000
Probability density
0.012
Degree(Weighted Network)
0
100
200
300
400
The absolute difference of gene-pairs
Figure S10.
0.5
1.0
1.5
KSStat=0.0385
P-value=1.4833e-007
0.0
Probability
density
2.0
2.5
Closeness Centrality(Weighted Network)
0.0
0.5
1.0
1.5
The average of gene-pairs
Figure S11.
2.0
0.5
1.0
1.5
2.0
KSStat=0.0441
P-value=9.3646e-010
0.0
Probability density
2.5
Closeness Centrality(Weighted Network)
0.0
0.5
1.0
1.5
The absolute difference of gene-pairs
Figure S12.
2e-05
4e-05
6e-05
KSStat=0.0529
P-value=7.0540e-014
0e+00
Probability density
Betweenness Centrality(Weighted Network)
0
5000
10000
15000
The average of gene-pairs
Figure S13.
20000
25000
0e+00
2e-05
4e-05
KSStat=0.0414
P-value=1.1590e-008
0
10000
20000
30000
40000
50000
60000
The absolute difference of gene-pairs
Figure S14.
Clustering Coefficients(Weighted Network)
1.0
2.0
3.0
KSStat=0.0691
P-value=2.3655e-024
0.0
Probability density
Probability density
6e-05
Betweenness Centrality(Weighted Network)
0.0
0.2
0.4
0.6
The average of gene-pairs
Figure S15.
0.8
1.0
4
Clustering Coefficients(Weighted Network)
3
2
1
0
Probability density
KSStat=0.0571
P-value=4.4346e-026
0.0
0.2
0.4
0.6
0.8
1.0
The absolute difference of gene-pairs
Figure S16.
Part Three: Figures show the empirical cumulative distributions of
different network properties across synthetic genetic interactions and
non-interaction gene pairs.
Supplementary Figure S17
The figure shows the empirical cumulative distribution of the average of
degree centrality across a pair of genes in case of the synthetic genetic
interaction pairs and non-synthetic genetic interaction pairs in the binary
network.
Supplementary Figure S18
The figure shows the empirical cumulative distribution of the average of
degree centrality across a pair of genes in case of the synthetic genetic
interaction pairs and non-synthetic genetic interaction pairs in weighted
functional gene network.
Supplementary Figure S19
The figure shows the empirical cumulative distribution of the absolute
difference of degree centrality across a pair of genes in case of the
synthetic genetic interaction pairs and non-synthetic genetic interaction
pairs in the binary network.
Supplementary Figure S20
The figure shows the empirical cumulative distribution of the absolute
difference of degree centrality across a pair of genes in case of the
synthetic genetic interaction pairs and non-synthetic genetic interaction
pairs in weighted functional gene network.
Supplementary Figure S21
The figure shows the empirical cumulative distribution of the average of
closeness centrality across a pair of genes in case of the synthetic genetic
interaction pairs and non-synthetic genetic interaction pairs in the binary
network.
Supplementary Figure S22
The figure shows the empirical cumulative distribution of the average of
closeness centrality across a pair of genes in case of the synthetic genetic
interaction pairs and non-synthetic genetic interaction pairs in weighted
functional gene network.
Supplementary Figure S23
The figure shows the empirical cumulative distribution of the absolute
difference of closeness centrality across a pair of genes in case of the
synthetic genetic interaction pairs and non-synthetic genetic interaction
pairs in the binary network.
Supplementary Figure S24
The figure shows the empirical cumulative distribution of the absolute
difference of closeness centrality across a pair of genes in case of the
synthetic genetic interaction pairs and non-synthetic genetic interaction
pairs in weighted functional gene network.
Supplementary Figure S25
The figure shows the empirical cumulative distribution of the average of
betweenness centrality across a pair of genes in case of the synthetic
genetic interaction pairs and non-synthetic genetic interaction pairs in
the binary network.
Supplementary Figure S26
The figure shows the empirical cumulative distribution of the average of
betweenness centrality across a pair of genes in case of the synthetic
genetic interaction pairs and non-synthetic genetic interaction pairs in
weighted functional gene network.
Supplementary Figure S27
The figure shows the empirical cumulative distribution of the absolute
difference of betweenness centrality across a pair of genes in case of the
synthetic genetic interaction pairs and non-synthetic genetic interaction
pairs in the binary network.
Supplementary Figure S28
The figure shows the empirical cumulative distribution of the absolute
difference of betweenness centrality across a pair of genes in case of the
synthetic genetic interaction pairs and non-synthetic genetic interaction
pairs in weighted functional gene network.
Supplementary Figure S29
The figure shows the empirical cumulative distribution of the average of
clustering coefficient across a pair of genes in case of the synthetic
genetic interaction pairs and non-synthetic genetic interaction pairs in
the binary network.
Supplementary Figure S30
The figure shows the empirical cumulative distribution of the average of
clustering coefficient across a pair of genes in case of the synthetic
genetic interaction pairs and non-synthetic genetic interaction pairs in
weighted functional gene network.
Supplementary Figure S31
The figure shows the empirical cumulative distribution of the absolute
difference of clustering coefficient across a pair of genes in case of the
synthetic genetic interaction pairs and non-synthetic genetic interaction
pairs in the binary network.
Supplementary Figure S32
The figure shows the empirical cumulative distribution of the absolute
difference of clustering coefficient across a pair of genes in case of the
synthetic genetic interaction pairs and non-synthetic genetic interaction
pairs in weighted functional gene network.
Empirical CDF ( Degree of Binary Networks )
1
0.9
0.8
0.7
F(x)
0.6
0.5
0.4
0.3
0.2
SGI Gene Pairs
Non-SGI Gene Pairs
0.1
0
0
0.005
0.01
0.015
0.02
The average of gene pairs
Figure S17.
0.025
0.03
0.035
Empirical CDF (Degree of Weighted Networks)
1
0.9
0.8
0.7
F(x)
0.6
0.5
0.4
0.3
0.2
SGI Gene Pairs
Non-SGI Gene Pairs
0.1
0
0
100
200
300
400
500
600
The average of gene-pairs
Figure S18.
Empirical CDF ( Degree of Binary Networks )
1
0.9
0.8
0.7
F(x)
0.6
0.5
0.4
0.3
0.2
SGI Gene Pairs
Non-SGI Gene Pairs
0.1
0
0
0.005
0.01
0.015
0.02
0.025
The absolute difference of gene pairs
Figure S19.
0.03
0.035
0.04
Empirical CDF (Degree of Weighted Networks)
1
0.9
0.8
0.7
F(x)
0.6
0.5
0.4
0.3
0.2
SGI Gene Pairs
Non-SGI Gene Pairs
0.1
0
0
100
200
300
400
500
600
The absolute difference of gene-pairs
Figure S20.
Empirical CDF ( Closeness Centrality of Binary Networks )
1
SGI Gene Pairs
0.9
Non-SGI Gene Pairs
0.8
0.7
F(x)
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.05
0.1
0.15
0.2
0.25
The average of gene pairs
Figure S21.
0.3
0.35
0.4
Empirical CDF (Closeness Centrality of Weighted Networks)
1
0.9
0.8
0.7
F(x)
0.6
0.5
0.4
0.3
0.2
SGI Gene Pairs
Non-SGI Gene Pairs
0.1
0
0.8
1
1.2
1.4
1.6
1.8
2
2.2
The average of gene-pairs
Figure S22.
Empirical CDF (Closeness Centrality of Binary Networks)
1
0.9
0.8
0.7
F(x)
0.6
0.5
0.4
0.3
0.2
SGI Gene Pairs
Non-SGI Gene Pairs
0.1
0
0
0.05
0.1
0.15
0.2
0.25
The absolute difference of gene pairs
Figure S23.
0.3
0.35
0.4
Empirical CDF ( Closeness Centrality of Weighted Networks )
1
0.9
0.8
0.7
F(x)
0.6
0.5
0.4
0.3
0.2
SGI Gene Pairs
Non-SGI Gene Pairs
0.1
0
0
0.2
0.4
0.6
0.8
1
1.2
The absolute difference of gene pairs
Figure S24.
Empirical CDF ( Betweenness Centrality of Binary Network )
1
0.9
0.8
0.7
F(x)
0.6
0.5
0.4
0.3
0.2
SGI Gene Pairs
Non-SGI Gene Pairs
0.1
0
0
1
2
3
The average of gene pairs
Figure S25.
4
5
6
-3
x 10
Empirical CDF (Betweenness Centrality of Weighted Networks)
1
0.9
0.8
0.7
F(x)
0.6
0.5
0.4
0.3
0.2
SGI Gene Pairs
Non-SGI Gene Pairs
0.1
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
The average of gene-pairs
5
x 10
Figure S26.
Empirical CDF ( Betweenness Centrality of Binary Networks )
1
0.9
0.8
0.7
F(x)
0.6
0.5
0.4
0.3
0.2
SGI Gene Pairs
Non-SGI Gene Pairs
0.1
0
0
1
2
3
4
5
The absolute difference of gene pairs
Figure S27.
6
7
8
-3
x 10
Empirical CDF ( Betweenness Centrality of Weighted Networks )
1
0.9
0.8
0.7
F(x)
0.6
0.5
0.4
0.3
0.2
SGI Gene Pairs
Non-SGI Gene Pairs
0.1
0
0
0.5
1
1.5
2
2.5
3
3.5
The absolute difference of gene pairs
5
x 10
Figure S28.
Empirical CDF ( Clustering Coefficient of Binary Networks )
1
0.9
0.8
0.7
F(x)
0.6
0.5
0.4
0.3
0.2
SGI Gene Pairs
Non-SGI Gene Pairs
0.1
0
0
0.1
0.2
0.3
0.4
0.5
The average of gene pairs
Figure S29.
0.6
0.7
0.8
Empirical CDF (Clustering Coefficient of Weighted Networks)
1
0.9
0.8
0.7
0.5
0.4
0.3
0.2
SGI Gene Pairs
Non-SGI Gene Pairs
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
The Average of Gene Pairs
Figure S30.
Empirical CDF ( Clustering Coefficient of Binary Networks)
1
0.9
0.8
0.7
0.6
F(x)
F(x)
0.6
0.5
0.4
0.3
0.2
SGI Gene Pairs
Non-SGI Gene Pairs
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
The absolute difference of gene pairs
Figure S31.
0.8
0.9
1
Empirical CDF (Clustering Coefficient of Weighted Netowks)
1
0.9
0.8
0.7
F(x)
0.6
0.5
0.4
0.3
0.2
SGI Gene Pairs
Non-SGI Gene Pairs
0.1
0
0
0.1
0.2
0.3
0.4
The absolute difference of gene pairs
Figure S32.
0.5
0.6
0.7
Download