Supplementary Materials - TNLIST, Department of Automation

advertisement
Global optimization-based inference of chemogenomic features
Zu,S. et al.
Supplementary Materials
Songpeng Zu1a, Ting Chen2, Shao Li1*
1Bioinformatics
Division, TNLIST and Department of Automation, Tsinghua University, Beijing
10084, China
2Molecular and Computational Biology Program, Department of Biological Science, University of
Southern California, Los Angeles, California 90089, USA
Table of Contents
Supplementary Materials .................................................................................................................. 1
Data Sources ............................................................................................................................. 2
Methods..................................................................................................................................... 3
The EM framework ........................................................................................................... 3
Derivation of EM algorithm in our model ........................................................................ 4
The Association Method ................................................................................................... 5
Variance Estimation of the EM results. ............................................................................. 6
Combinations of drug chemical substructures .................................................................. 8
Estimation of fn and fp.............................................................................................................. 9
Results of the predicted drug-domain interactions .................................................................... 9
Reference ................................................................................................................................ 11
a
Contact: zsp07@mails.tsinghua.edu.cn
1 / 11
Global optimization-based inference of chemogenomic features
Zu,S. et al.
Data Sources
The information of drug-protein interactions, drug substructures, and protein domains is obtained
from Tabei et al., (2012). A total of 1862 drugs are represented by 881-dimensional chemical
substructure binary vectors from PubChem database[2], and 1554 proteins are represented by 876dimensional protein domain binary vectors from the Pfam database[3]. 4809 interactions exist
between the drugs and the proteins. We deleted the drug chemical substructures or protein domains
that never appeared in the drugs or proteins, and we merged those substructures or domains that
appeared in the same drugs or proteins.
The drug-domain interactions were extracted from PDB database by the script from Kruger et al.,
(2012). We only chose proteins that had multiple domains for our data. Finally, 53 pairs of drugprotein interactions with the records of drug-domain interaction were used.
2 / 11
Global optimization-based inference of chemogenomic features
Zu,S. et al.
Methods
The EM framework
Here we propose a probabilistic model to infer the substructure-domain interactions. It is inspired
by the work [7], and follows the assumptions below:
1. The interactions between drug chemical substructures and protein domains in the pair of drugtarget interactions are independent.
2. A drug and a protein interact if and only if at least one pair of the drug chemical substructures
and the protein domains interact.
Let Yi represent the drug i, Pj represent the protein j, Zm represent the chemical substructure m, and
(𝑖𝑗)
Dn represent the domain n. Let π‘π·π‘šπ‘› denote whether the pair of the chemical substructure m from
the drug i and the protein domain n from the protein j interact (the value is one) or not (zero
otherwise). We use πœƒπ‘šπ‘› to present the interaction possibilities of the chemical substructure m and
the protein domain n, that is,
(𝑖𝑗)
πœƒπ‘šπ‘› = Pr (π‘π·π‘šπ‘› = 1)
Then our aim is to evaluate the interaction possibilities θ = {πœƒπ‘šπ‘› } of the chemical substructures
and the protein domains.
Under the Assumption 1 and 2, we can get
Pr(π‘Œπ‘ƒπ‘–π‘— = 1|θ) = 1 − ∏ (1 − πœƒπ‘šπ‘› )
(𝑖𝑗)
π‘π·π‘šπ‘›
In which, π‘Œπ‘ƒπ‘–π‘— denote whether or not the drug Yi and the protein Pj interact or not. If π‘Œπ‘ƒπ‘–π‘— equals
to 1, then they interact, and 0 otherwise.
Since we know many drug-protein interactions remain unknown, which means YP cannot directly
be used to denote the observed drug-protein interactions data, we then use the O = {Oij} (where Oij
denote whether the drug i and the protein j interact or not, one for interaction, zero otherwise) to
represent the given drug-protein interactions. In addition, in order to connect YP with O, we
introduce two parameters, namely, the false negative rate fn and the false positive rate fp defined
below:
fp = Pr(𝑂𝑖𝑗 = 1|π‘Œπ‘ƒπ‘–π‘— = 0)
fn = Pr(𝑂𝑖𝑗 = 0|π‘Œπ‘ƒπ‘–π‘— = 1)
Then, we can get
Pr(𝑂𝑖𝑗 = 1|θ) = ∑ Pr(𝑂𝑖𝑗 = 1|π‘Œπ‘ƒπ‘–π‘— = 𝑑) Pr(π‘Œπ‘ƒπ‘–π‘— = 𝑑|πœƒ)
𝑑=0,1
= (1 − fn) Pr(π‘Œπ‘ƒπ‘–π‘— = 1|θ) + fp(1 − Pr(π‘Œπ‘ƒπ‘–π‘— = 1|θ) )
Moreover, the log likelihood function, i.e., the total probability of the observed drug-protein
interactions data is
l(θ) = log(Pr(𝑂|πœƒ))
= log (∏ π‘ƒπ‘Ÿ (𝑂𝑖𝑗 = 1|πœƒ)
𝑖,𝑗
3 / 11
π‘œπ‘–π‘—
1−𝑂𝑖𝑗
Pr(𝑂𝑖𝑗 = 0)
)
Global optimization-based inference of chemogenomic features
Zu,S. et al.
In which θ = ({πœƒπ‘šπ‘› }, fn, fp), where fn and fp are predefined.
Then our aim is to estimate θ based on the maximum likelihood estimation (MLE). However,
(𝑖𝑗)
because we don’t know whether π‘π·π‘šπ‘› = 1 or 0 (which means the chemical substructure m from
drug i interact with the protein domain n from protein j or not), this is a missing data problem. It is
naturally to solve the problem by EM algorithm [6]. It follows:
 The E step is :
(𝑖𝑗)
(𝑖𝑗)
E (π‘π·π‘šπ‘› |O, πœƒ (𝑑−1) ) = 𝐸 (π‘π·π‘šπ‘› |𝑂𝑖𝑗 , πœƒ (𝑑−1) )
(𝑖𝑗)
= Pr (π‘π·π‘šπ‘› = 1|𝑂𝑖𝑗 , πœƒ (𝑑−1) )
(𝑖𝑗)
=
Pr (π‘π·π‘šπ‘› = 1, 𝑂𝑖𝑗 |πœƒ (𝑑−1) )
Pr(𝑂𝑖𝑗 |πœƒ (𝑑−1) )
(𝑑−1)
πœƒπ‘šπ‘› (1 − 𝑓𝑛)𝑂𝑖𝑗 𝑓𝑛1−𝑂𝑖𝑗
=

Pr(𝑂𝑖𝑗 |πœƒ (𝑑−1) )
The M step is :
(𝑑)
πœƒπ‘šπ‘› =
1
(𝑖𝑗)
∑ 𝐸 (π‘π·π‘šπ‘› |O, πœƒ (𝑑−1) )
π‘π‘šπ‘›
𝑖,𝑗
Note that π‘π‘šπ‘› is the total number of drug-protein pairs that contain the chemical substructure m
and the protein domain n.
Derivation of EM algorithm in our model
Here we would show how to derive EM algorithm in our model. In general, two steps are involved
in EM algorithm:
 E step:
𝑄(πœƒ, πœƒ (𝑑) ) = 𝐸𝑦|π‘₯,πœƒ(𝑑) [log(Pr(𝑋, π‘Œ|πœƒ))]

M step:
πœƒ (𝑑+1) = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯πœƒ (𝑄(πœƒ, πœƒ (𝑑) ))
In which, X represents the observed and incomplete data. (X,Y) then are the complete data, while
Y is the latent data.
In our model,
𝑄(πœƒ, πœƒ (𝑑) ) = 𝐸𝑍𝐷|𝑂,πœƒ(𝑑) [log(Pr(𝑍𝐷, 𝑂|πœƒ))]
= 𝐸𝑍𝐷|𝑂,πœƒ(𝑑) [∑ log (Pr (𝑂𝑖𝑗 |𝑍𝐷
(𝑖𝑗)
𝑖,𝑗
4 / 11
) Pr (𝑍𝐷
(𝑖𝑗)
|πœƒ))]
Global optimization-based inference of chemogenomic features
= ∑ 𝐸𝑍𝐷|𝑂,πœƒ(𝑑) [log (Pr (𝑂𝑖𝑗 |𝑍𝐷
(𝑖𝑗)
Zu,S. et al.
))] + ∑ 𝐸𝑍𝐷|𝑂,πœƒ(𝑑) [log (Pr (𝑍𝐷
𝑖,𝑗
(𝑖𝑗)
|πœƒ))]
𝑖,𝑗
Not that the first summation has nothing to do with πœƒ, while the last summation can be rewritten as
followed:
(𝑖𝑗)
last sum = ∑ 𝐸𝑍𝐷|𝑂,πœƒ(𝑑) [log (∏ Pr (π‘π·π‘šπ‘› |πœƒ))]
𝑖,𝑗
π‘š,𝑛
𝑍𝐷
(𝑖𝑗)
(𝑖𝑗)
= ∑ ∑ 𝐸𝑍𝐷|𝑂,πœƒ(𝑑) [log (πœƒπ‘šπ‘›π‘šπ‘› (1 − πœƒπ‘šπ‘› )1−π‘π·π‘šπ‘› )]
𝑖,𝑗 π‘šπ‘›
Then,
𝑍𝐷
(𝑖𝑗)
(𝑖𝑗)
πœ• log (πœƒπ‘šπ‘›π‘šπ‘› (1 − πœƒπ‘šπ‘› )1−π‘π·π‘šπ‘› )
∂𝑄(πœƒ, πœƒ (𝑑) )
= ∑ 𝐸𝑍𝐷|𝑂,πœƒ(𝑑) [
]
∂πœƒπ‘šπ‘›
πœ•πœƒπ‘šπ‘›
𝑖,𝑗
(𝑖𝑗)
(𝑖𝑗)
π‘π·π‘šπ‘› 1 − π‘π·π‘šπ‘›
= ∑ 𝐸𝑍𝐷|𝑂,πœƒ(𝑑) [
−
]
πœƒπ‘šπ‘›
1 − πœƒπ‘šπ‘›
𝑖,𝑗
Let the formula above equals to zero, we can finally get our EM procedure.
The Association Method
One of the problem of EM algorithm is that it converge to a local minimum and different initial
values usually result in different local minimums. Instead of randomly choosing the initial values
many times, here we used the association model to choose the initial values, which is a local way to
evaluate the possibilities of drug substructures and protein domains interactions.
It follows,
πΌπ‘šπ‘›
πœƒπ‘šπ‘› =
π‘π‘šπ‘›
in which Imn is the number of interacting pairs of drug-protein pairs containing the pair of chemical
substructure Zm and protein domain Dn and Nmn is the number of total drug-protein pairs containing
the pair of chemical substructure Zm and protein domain Dn.
This method has two limitations.
 Firstly, it computes the chemical substructure-protein domain interactions locally, which means
it ignores other interactions between the chemical substructures and protein domains in the
same drug-protein pairs. For example, drug Yi containing substructures {Zm ,Zy } interacts with
both protein Pj containing domains {Dn ,Dy} and protein Pk containing domains {Dn ,Dc }.
Substructure Zm and protein domain Dn do not appear in any other drugs and proteins,
respectively. Then πœƒπ‘šπ‘› = 2/2 = 1. It obviously ignores other interactions between
substructures and protein domains such as substructures Zn interacting with protein domain Dc.
Therefore, to infer drug substructure and protein domain interactions, we should consider all
the drug protein interactions and all the interactions between drug substructures and protein
domains.
 Secondly, this method relies on the accuracy of observed data. However, current drug-protein
data are largely incomplete.
5 / 11
Global optimization-based inference of chemogenomic features
Zu,S. et al.
Variance Estimation of the EM results.
The natural way to estimate the variance of the maximum likelihood estimation is followed [9]:
1
var(πœƒΜ‚ ) ≈
𝐼(πœƒΜ‚)
I(θ) is the observed information.
𝑑2
π‘™π‘œπ‘”π‘ƒπ‘Ÿ(π‘₯|πœƒ)
π‘‘πœƒ 2
In our situation, we derive the observed information below.
πœ•2
I(πœƒπ‘šπ‘› ) = −
π‘™π‘œπ‘”π‘ƒπ‘Ÿ(𝑂|πœƒ)
πœ•πœƒπ‘šπ‘› 2
I(θ) = −
= −
πœ•2
(π‘šπ‘›)
(π‘šπ‘›)
log (Pr (𝑂𝑖𝑗
2 ∑ (𝑂𝑖𝑗
πœ•πœƒπ‘šπ‘› 𝑖,𝑗
(π‘šπ‘›)
= 1|πœƒ)) + (1 − 𝑂𝑖𝑗
(π‘šπ‘›)
) log (Pr (𝑂𝑖𝑗
= 0|πœƒ)))
Since
πœ•
(π‘šπ‘›)
(π‘šπ‘›)
(π‘šπ‘›)
(π‘šπ‘›)
∑ (𝑂𝑖𝑗 log (Pr (𝑂𝑖𝑗
= 1|πœƒ)) + (1 − 𝑂𝑖𝑗 ) log (Pr (𝑂𝑖𝑗
= 0|πœƒ)))
πœ•πœƒπ‘šπ‘›
𝑖,𝑗
(π‘šπ‘›)
= ∑
𝑖,𝑗
𝑂𝑖𝑗
(π‘šπ‘›)
Pr (𝑂𝑖𝑗
(π‘šπ‘›)
1 − 𝑂𝑖𝑗
πœ•
πœ•
(π‘šπ‘›)
(π‘šπ‘›)
Pr (𝑂𝑖𝑗
= 1|πœƒ) +
Pr (𝑂𝑖𝑗
= 0|πœƒ)
(π‘šπ‘›)
πœ•πœƒ
πœ•πœƒ
= 1|πœƒ) π‘šπ‘›
Pr (𝑂
= 0|πœƒ) π‘šπ‘›
𝑖𝑗
And
πœ•
πœ•
(π‘šπ‘›)
(π‘šπ‘›)
(π‘šπ‘›)
Pr (𝑂𝑖𝑗
= 1|πœƒ) =
((1 − fn) Pr (π‘Œπ‘ƒπ‘–π‘—
= 1|θ) + fp (1 − Pr (π‘Œπ‘ƒπ‘–π‘—
= 1|θ) ))
πœ•πœƒπ‘šπ‘›
πœ•πœƒπ‘šπ‘›
Where
πœ•
πœ•
(π‘šπ‘›)
Pr (π‘Œπ‘ƒπ‘–π‘—
= 1|πœƒ) =
(1 −
πœ•πœƒπ‘šπ‘›
πœ•πœƒπ‘šπ‘›
=
∏
(𝑖𝑗) π‘€π‘–π‘‘β„Ž(π‘šπ‘›)
𝑍𝐷𝑠𝑑
(1 − πœƒπ‘ π‘‘ )
∏
(𝑖𝑗) π‘π‘œπ‘›π‘‘π‘Žπ‘–π‘› (π‘šπ‘›)
𝑍𝐷
𝑠𝑑 ( π‘€π‘–π‘‘β„Žπ‘œπ‘’π‘‘(π‘šπ‘›))
=
∏ (1 − πœƒπ‘ π‘‘ )
(𝑖𝑗)∗
𝑍𝐷𝑠𝑑 ≠π‘šπ‘›
Then,
(𝑖𝑗)
π›Ώπ‘šπ‘› =
πœ•
(π‘šπ‘›)
Pr (𝑂𝑖𝑗
= 1|πœƒ)
πœ•πœƒπ‘šπ‘›
= (1 − fn − fp)
∏ (1 − πœƒπ‘ π‘‘ )
(𝑖𝑗)∗
𝑍𝐷𝑠𝑑 ≠π‘šπ‘›
Also, let
6 / 11
(1 − πœƒπ‘ π‘‘ ))
Global optimization-based inference of chemogenomic features
(π‘šπ‘›)
(𝑖𝑗)
πœ‡π‘šπ‘› = Pr (𝑂𝑖𝑗
Zu,S. et al.
= 1|πœƒ)
We can get
πœ•
(π‘šπ‘›)
(π‘šπ‘›)
(π‘šπ‘›)
(π‘šπ‘›)
∑ (𝑂𝑖𝑗 log (Pr (𝑂𝑖𝑗
= 1|πœƒ)) + (1 − 𝑂𝑖𝑗 ) log (Pr (𝑂𝑖𝑗
= 0|πœƒ)))
πœ•πœƒπ‘šπ‘›
𝑖,𝑗
(π‘šπ‘›)
= ∑(
𝑖,𝑗
𝑂𝑖𝑗
(𝑖𝑗)
πœ‡π‘šπ‘›
(π‘šπ‘›)
−
1 − 𝑂𝑖𝑗
1−
(𝑖𝑗)
(𝑖𝑗)
πœ‡π‘šπ‘›
)π›Ώπ‘šπ‘›
Then
(π‘šπ‘›)
(π‘šπ‘›)
(π‘šπ‘›)
(π‘šπ‘›)
𝑂𝑖𝑗
1 − 𝑂𝑖𝑗
𝑂𝑖𝑗
1 − 𝑂𝑖𝑗
πœ•
(𝑖𝑗)
(𝑖𝑗) πœ•
∑( (𝑖𝑗) −
)π›Ώπ‘šπ‘› = ∑ π›Ώπ‘šπ‘›
( (𝑖𝑗) −
)
(𝑖𝑗)
(𝑖𝑗)
πœ•πœƒπ‘šπ‘›
πœ•πœƒπ‘šπ‘› πœ‡
πœ‡
1
−
πœ‡
1
−
πœ‡
π‘šπ‘›
π‘šπ‘›
π‘šπ‘›
π‘šπ‘›
𝑖,𝑗
𝑖,𝑗
Note that,
πœ•
(𝑖𝑗)
𝛿
=0
πœ•πœƒπ‘šπ‘› π‘šπ‘›
Besides,
πœ•
(𝑖𝑗)
(𝑖𝑗)
πœ‡
= π›Ώπ‘šπ‘›
πœ•πœƒπ‘šπ‘› π‘šπ‘›
(π‘šπ‘›)
(π‘šπ‘›)
(π‘šπ‘›)
(π‘šπ‘›)
𝑂𝑖𝑗
1 − 𝑂𝑖𝑗
𝑂𝑖𝑗
1 − 𝑂𝑖𝑗
πœ•
(𝑖𝑗)
(𝑖𝑗)
( (𝑖𝑗) −
)=−
π›Ώπ‘šπ‘› −
𝛿
2
(𝑖𝑗)
(𝑖𝑗) 2 π‘šπ‘›
(𝑖𝑗)
πœ•πœƒπ‘šπ‘› πœ‡
1 − πœ‡π‘šπ‘›
(1 − πœ‡π‘šπ‘› )
πœ‡π‘šπ‘›
π‘šπ‘›
Finally, we have
I(πœƒπ‘šπ‘› ) = ∑
(𝑖,𝑗)∗
(π‘šπ‘›)
(𝑖𝑗) 2 𝑂𝑖𝑗
π›Ώπ‘šπ‘› (
(𝑖𝑗) 2
πœ‡π‘šπ‘›
7 / 11
(π‘šπ‘›)
+
1 − 𝑂𝑖𝑗
(𝑖𝑗)
(1 − πœ‡π‘šπ‘› )2
)
Global optimization-based inference of chemogenomic features
Zu,S. et al.
Combinations of drug chemical substructures
Different drug chemical substructures may take functions as one unit in the drug-protein interactions.
We try to estimate the combination behaviors between two drug chemical substructures on drugprotein interactions through adding the pairs of drug chemical substructures as the new “drug
chemical substructures”, we handle this problem by our probabilistic model.
Instead of considering all the pairs of drug chemical substructures, we firstly use a filter method
to select those pairs of drug chemical substructures that significantly appear in the drug-protein
interactions. The filter method follows two steps: we use hypergeometric distribution to detect
whether the co-appearing times of two drug chemical substructures are significant or not. Then we
move out the pairs of drug chemical substructures that also significantly appear in the randomly
selected compounds.
The reason why we follow this procedure is that we are only interested in the combinations of
drug chemical substructures that can interact with proteins but not often co-exists in the compound
chemical space. The randomly selected compounds are from CHEMBL database, and in total, there
are over 9,000 compounds representing the compound chemical space. We use the Bonferroni
adjustment for the multiple test corrections here. Note that due to the PubChem substructures
definition, we only consider the pairs with SMARTS records. Finally, we select 1870 pairs of drug
chemical substructures for learning.
8 / 11
Global optimization-based inference of chemogenomic features
Zu,S. et al.
Estimation of fn and fp.
In our model, two parameters, i.e., fn and fp, should be predefined. According to our model,
fn = Pr(𝑂𝑖𝑗 = 0|π‘Œπ‘ƒπ‘–π‘— = 1)
Pr(𝑂𝑖𝑗 = 1, π‘Œπ‘ƒπ‘–π‘— = 1)
=1−
Pr(π‘Œπ‘ƒπ‘–π‘— = 1)
Pr(𝑂𝑖𝑗 = 1)
≥1−
Pr(π‘Œπ‘ƒπ‘–π‘— = 1)
π‘›π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘œπ‘π‘’π‘Ÿπ‘ π‘’π‘£π‘’π‘‘ π‘–π‘›π‘‘π‘’π‘Ÿπ‘Žπ‘π‘‘π‘–π‘œπ‘›π‘  π‘Žπ‘šπ‘œπ‘›π‘” π‘Ÿπ‘’π‘Žπ‘™ π‘–π‘›π‘‘π‘’π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›π‘ 
≈1−
π‘›π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘‘β„Žπ‘’ π‘Ÿπ‘’π‘Žπ‘™ π‘‘π‘Ÿπ‘’π‘” π‘Žπ‘›π‘‘ π‘π‘Ÿπ‘œπ‘‘π‘’π‘–π‘› π‘–π‘›π‘‘π‘’π‘Ÿπ‘Žπ‘π‘‘π‘–π‘œπ‘›π‘ 
It is shown that on average the number of target proteins per drug is about 6.3[8]. Then we can get,
4809
= 0.41
1863 × 6.3
We can estimate fp, which equals to Pr(𝑂𝑖𝑗 = 1|π‘Œπ‘ƒπ‘–π‘— = 0), in the similar way.
fp = Pr(𝑂𝑖𝑗 = 1|π‘Œπ‘ƒπ‘–π‘— = 0)
Pr(𝑂𝑖𝑗 = 1, π‘Œπ‘ƒπ‘–π‘— = 0)
=
Pr(π‘Œπ‘ƒπ‘–π‘— = 0)
Pr(𝑂𝑖𝑗 = 1)
≤
Pr(π‘Œπ‘ƒπ‘–π‘— = 0)
π‘›π‘’π‘šπ‘π‘Ÿ π‘œπ‘“ π‘–π‘›π‘‘π‘’π‘Ÿπ‘Žπ‘π‘‘π‘–π‘œπ‘› π‘π‘Žπ‘–π‘Ÿπ‘ 
≈
π‘›π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘‘π‘œπ‘‘π‘Žπ‘™ π‘π‘Žπ‘–π‘Ÿπ‘  − π‘›π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘–π‘›π‘‘π‘’π‘Ÿπ‘Žπ‘π‘‘π‘–π‘œπ‘› π‘π‘Žπ‘–π‘Ÿπ‘ 
fn ≥ 1 −
4809
1863 × 1554 − 4809
≤ 1.67 × 10−3
=
In order to analyze our model robustness to these parameters, we used five folds cross validation
to detect the recoveries of drug-protein interactions on different combinations of fn and fp. It showed
that the performances of recovering drug-protein interactions kept stable on different combinations
of fn and fp.
The procedure are followed: (i) split the original drug-target interactions equally to five fold.
(ii)Each time, we select one of them as the test data set and use the others as the training set in our
model. (iii) Estimate the test set after learning by the area under the operating characteristic curve
(AUC). The curve is generated by plotting the false positive rate in the x-axis versus true positive
rate. Note that the negative samples are randomly selected from the known non-interacted drug
protein pairs, since we do not have the real negative samples.
Results of the predicted drug-domain interactions
9 / 11
Global optimization-based inference of chemogenomic features
Zu,S. et al.
Protein
Uniprot ID
Compound
PubChem ID
Protein Domain
Pfam ID
k value
Prediction by GIFT
ITAL_HUMAN
LKHA4_HUMAN
LKHA4_HUMAN
SRC_HUMAN
SRC_HUMAN
SRC_HUMAN
CATS_HUMAN
LDHA_HUMAN
LKHA4_HUMAN
PDE5A_HUMAN
SRC_HUMAN
ANDR_HUMAN
ANDR_HUMAN
ANDR_HUMAN
ANDR_HUMAN
DNMT1_HUMAN
ESR1_HUMAN
MMP3_HUMAN
MMP8_HUMAN
NOS3_HUMAN
PDE10_HUMAN
RXRA_HUMAN
THRB_HUMAN
TPA_HUMAN
ADH1A_HUMAN
ADH1B_HUMAN
ADH1B_HUMAN
ANDR_HUMAN
ANDR_HUMAN
ANDR_HUMAN
DHSO_HUMAN
ESR1_HUMAN
ESR1_HUMAN
ESR1_HUMAN
ESR1_HUMAN
ESR1_HUMAN
ESR1_HUMAN
ESR1_HUMAN
ESR2_HUMAN
ESR2_HUMAN
GCR_HUMAN
GCR_HUMAN
NOS3_HUMAN
OTC_HUMAN
PDE5A_HUMAN
PDE5A_HUMAN
PRGR_HUMAN
PRGR_HUMAN
PRGR_HUMAN
PRGR_HUMAN
ROCK1_HUMAN
RXRA_HUMAN
THRB_HUMAN
53232
445154
90334
311
867
971
5287799
974
72172
110634
5287544
261000
3371
5803
5920
439155
5280961
1990
1990
2733
4680
82146
2332
2332
5287890
347402
80654
10635
56069
6013
132302
448577
449205
449207
449209
5035
5757
5870
5280961
5757
55245
5743
1893
124992
110635
5212
261000
4369524
5994
6230
3064778
444795
5326608
PF00092
PF01433
PF01433
PF00017
PF00017
PF00017
PF00112
PF02866
PF01433
PF00233
PF00017
PF00104
PF00104
PF00104
PF00104
PF00145
PF00104
PF00413
PF00413
PF02898
PF00233
PF00104
PF00089
PF00089
PF08240
PF08240
PF08240
PF00104
PF00104
PF00104
PF08240
PF00104
PF00104
PF00104
PF00104
PF00104
PF00104
PF00104
PF00104
PF00104
PF00104
PF00104
PF02898
PF00185
PF00233
PF00233
PF00104
PF00104
PF00104
PF00104
PF00069
PF00104
PF00089
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.70
0.90
1.00
0.81
0.91
0.90
0.80
0.83
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.53
0.57
0.58
0.91
0.84
0.91
0.53
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.90
1.00
1.00
0.50
1.00
1.00
1.00
1.00
1.00
1.00
0.94
1.00
1.00
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
Table 1. Results of the prediction on drug-domain interactions. k value is the proportion of the number
of the binding site residues lying within the protein domain over the total number of the binding site
residues. When k value was larger than 0.5, we treated the drug interacted with the domain. TRUE means
the drug was predicted to interact with domain by GIFT.
10 / 11
Global optimization-based inference of chemogenomic features
Zu,S. et al.
Reference
1. Tabei, Y., Pauwels, E., Stoven, V., Takemoto, K., & Yamanishi, Y. (2012). Identification of
chemogenomic features from drug–target interaction networks using interpretable
classifiers. Bioinformatics, 28(18), i487-i494.
2. Bolton E, Wang Y, Thiessen PA, Bryant SH. PubChem: Integrated Platform of Small Molecules
and Biological Activities. Chapter 12 IN Annual Reports in Computational Chemistry, Volume
4, Elsevier: Oxford, UK; 2008, pp. 217-240.
3. The Pfam protein families database: M. Punta, P.C. Coggill, R.Y. Eberhardt, J. Mistry, J. Tate,
C. Boursnell, N. Pang, K. Forslund, G. Ceric, J. Clements, A. Heger, L. Holm, E.L.L.
Sonnhammer, S.R. Eddy, A. Bateman, R.D. Finn Nucleic Acids Research (2012) Database
Issue 40:D290-D301
4. Finn, R. D., Miller, B. L., Clements, J., & Bateman, A. (2014). iPfam: a database of protein
family and domain interactions found in the Protein Data Bank. Nucleic acids research, 42(D1),
D364-D373.
5. Kruger, F. A., Rostom, R., & Overington, J. P. (2012). Mapping small molecule binding data to
structural domains. BMC bioinformatics, 13(Suppl 17), S11.
6. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal statistical Society, 39(1), 1-38.
7. Deng, M., Mehta, S., Sun, F., & Chen, T. (2002). Inferring domain–domain interactions from
protein–protein interactions. Genome research, 12(10), 1540-1548.
8. Mestres, J., Gregori-Puigjane, E., Valverde, S., & Sole, R. V. (2008). Data completeness—the
Achilles heel of drug-target networks. Nature biotechnology, 26(9), 983-984.
9. Efron, B., & Hinkley, D. V. (1978). Assessing the accuracy of the maximum likelihood
estimator: Observed versus expected Fisher information.Biometrika, 65(3), 457-483.
11 / 11
Download