Global optimization-based inference of chemogenomic features Zu,S. et al. Supplementary Materials Songpeng Zu1a, Ting Chen2, Shao Li1* 1Bioinformatics Division, TNLIST and Department of Automation, Tsinghua University, Beijing 10084, China 2Molecular and Computational Biology Program, Department of Biological Science, University of Southern California, Los Angeles, California 90089, USA Table of Contents Supplementary Materials .................................................................................................................. 1 Data Sources ............................................................................................................................. 2 Methods..................................................................................................................................... 3 The EM framework ........................................................................................................... 3 Derivation of EM algorithm in our model ........................................................................ 4 The Association Method ................................................................................................... 5 Variance Estimation of the EM results. ............................................................................. 6 Combinations of drug chemical substructures .................................................................. 8 Estimation of fn and fp.............................................................................................................. 9 Results of the predicted drug-domain interactions .................................................................... 9 Reference ................................................................................................................................ 11 a Contact: zsp07@mails.tsinghua.edu.cn 1 / 11 Global optimization-based inference of chemogenomic features Zu,S. et al. Data Sources The information of drug-protein interactions, drug substructures, and protein domains is obtained from Tabei et al., (2012). A total of 1862 drugs are represented by 881-dimensional chemical substructure binary vectors from PubChem database[2], and 1554 proteins are represented by 876dimensional protein domain binary vectors from the Pfam database[3]. 4809 interactions exist between the drugs and the proteins. We deleted the drug chemical substructures or protein domains that never appeared in the drugs or proteins, and we merged those substructures or domains that appeared in the same drugs or proteins. The drug-domain interactions were extracted from PDB database by the script from Kruger et al., (2012). We only chose proteins that had multiple domains for our data. Finally, 53 pairs of drugprotein interactions with the records of drug-domain interaction were used. 2 / 11 Global optimization-based inference of chemogenomic features Zu,S. et al. Methods The EM framework Here we propose a probabilistic model to infer the substructure-domain interactions. It is inspired by the work [7], and follows the assumptions below: 1. The interactions between drug chemical substructures and protein domains in the pair of drugtarget interactions are independent. 2. A drug and a protein interact if and only if at least one pair of the drug chemical substructures and the protein domains interact. Let Yi represent the drug i, Pj represent the protein j, Zm represent the chemical substructure m, and (ππ) Dn represent the domain n. Let ππ·ππ denote whether the pair of the chemical substructure m from the drug i and the protein domain n from the protein j interact (the value is one) or not (zero otherwise). We use πππ to present the interaction possibilities of the chemical substructure m and the protein domain n, that is, (ππ) πππ = Pr (ππ·ππ = 1) Then our aim is to evaluate the interaction possibilities θ = {πππ } of the chemical substructures and the protein domains. Under the Assumption 1 and 2, we can get Pr(ππππ = 1|θ) = 1 − ∏ (1 − πππ ) (ππ) ππ·ππ In which, ππππ denote whether or not the drug Yi and the protein Pj interact or not. If ππππ equals to 1, then they interact, and 0 otherwise. Since we know many drug-protein interactions remain unknown, which means YP cannot directly be used to denote the observed drug-protein interactions data, we then use the O = {Oij} (where Oij denote whether the drug i and the protein j interact or not, one for interaction, zero otherwise) to represent the given drug-protein interactions. In addition, in order to connect YP with O, we introduce two parameters, namely, the false negative rate fn and the false positive rate fp defined below: fp = Pr(πππ = 1|ππππ = 0) fn = Pr(πππ = 0|ππππ = 1) Then, we can get Pr(πππ = 1|θ) = ∑ Pr(πππ = 1|ππππ = π‘) Pr(ππππ = π‘|π) π‘=0,1 = (1 − fn) Pr(ππππ = 1|θ) + fp(1 − Pr(ππππ = 1|θ) ) Moreover, the log likelihood function, i.e., the total probability of the observed drug-protein interactions data is l(θ) = log(Pr(π|π)) = log (∏ ππ (πππ = 1|π) π,π 3 / 11 πππ 1−πππ Pr(πππ = 0) ) Global optimization-based inference of chemogenomic features Zu,S. et al. In which θ = ({πππ }, fn, fp), where fn and fp are predefined. Then our aim is to estimate θ based on the maximum likelihood estimation (MLE). However, (ππ) because we don’t know whether ππ·ππ = 1 or 0 (which means the chemical substructure m from drug i interact with the protein domain n from protein j or not), this is a missing data problem. It is naturally to solve the problem by EM algorithm [6]. It follows: ο¬ The E step is : (ππ) (ππ) E (ππ·ππ |O, π (π‘−1) ) = πΈ (ππ·ππ |πππ , π (π‘−1) ) (ππ) = Pr (ππ·ππ = 1|πππ , π (π‘−1) ) (ππ) = Pr (ππ·ππ = 1, πππ |π (π‘−1) ) Pr(πππ |π (π‘−1) ) (π‘−1) πππ (1 − ππ)πππ ππ1−πππ = ο¬ Pr(πππ |π (π‘−1) ) The M step is : (π‘) πππ = 1 (ππ) ∑ πΈ (ππ·ππ |O, π (π‘−1) ) πππ π,π Note that πππ is the total number of drug-protein pairs that contain the chemical substructure m and the protein domain n. Derivation of EM algorithm in our model Here we would show how to derive EM algorithm in our model. In general, two steps are involved in EM algorithm: ο¬ E step: π(π, π (π‘) ) = πΈπ¦|π₯,π(π‘) [log(Pr(π, π|π))] ο¬ M step: π (π‘+1) = ππππππ₯π (π(π, π (π‘) )) In which, X represents the observed and incomplete data. (X,Y) then are the complete data, while Y is the latent data. In our model, π(π, π (π‘) ) = πΈππ·|π,π(π‘) [log(Pr(ππ·, π|π))] = πΈππ·|π,π(π‘) [∑ log (Pr (πππ |ππ· (ππ) π,π 4 / 11 ) Pr (ππ· (ππ) |π))] Global optimization-based inference of chemogenomic features = ∑ πΈππ·|π,π(π‘) [log (Pr (πππ |ππ· (ππ) Zu,S. et al. ))] + ∑ πΈππ·|π,π(π‘) [log (Pr (ππ· π,π (ππ) |π))] π,π Not that the first summation has nothing to do with π, while the last summation can be rewritten as followed: (ππ) last sum = ∑ πΈππ·|π,π(π‘) [log (∏ Pr (ππ·ππ |π))] π,π π,π ππ· (ππ) (ππ) = ∑ ∑ πΈππ·|π,π(π‘) [log (πππππ (1 − πππ )1−ππ·ππ )] π,π ππ Then, ππ· (ππ) (ππ) π log (πππππ (1 − πππ )1−ππ·ππ ) ∂π(π, π (π‘) ) = ∑ πΈππ·|π,π(π‘) [ ] ∂πππ ππππ π,π (ππ) (ππ) ππ·ππ 1 − ππ·ππ = ∑ πΈππ·|π,π(π‘) [ − ] πππ 1 − πππ π,π Let the formula above equals to zero, we can finally get our EM procedure. The Association Method One of the problem of EM algorithm is that it converge to a local minimum and different initial values usually result in different local minimums. Instead of randomly choosing the initial values many times, here we used the association model to choose the initial values, which is a local way to evaluate the possibilities of drug substructures and protein domains interactions. It follows, πΌππ πππ = πππ in which Imn is the number of interacting pairs of drug-protein pairs containing the pair of chemical substructure Zm and protein domain Dn and Nmn is the number of total drug-protein pairs containing the pair of chemical substructure Zm and protein domain Dn. This method has two limitations. ο¬ Firstly, it computes the chemical substructure-protein domain interactions locally, which means it ignores other interactions between the chemical substructures and protein domains in the same drug-protein pairs. For example, drug Yi containing substructures {Zm ,Zy } interacts with both protein Pj containing domains {Dn ,Dy} and protein Pk containing domains {Dn ,Dc }. Substructure Zm and protein domain Dn do not appear in any other drugs and proteins, respectively. Then πππ = 2/2 = 1. It obviously ignores other interactions between substructures and protein domains such as substructures Zn interacting with protein domain Dc. Therefore, to infer drug substructure and protein domain interactions, we should consider all the drug protein interactions and all the interactions between drug substructures and protein domains. ο¬ Secondly, this method relies on the accuracy of observed data. However, current drug-protein data are largely incomplete. 5 / 11 Global optimization-based inference of chemogenomic features Zu,S. et al. Variance Estimation of the EM results. The natural way to estimate the variance of the maximum likelihood estimation is followed [9]: 1 var(πΜ ) ≈ πΌ(πΜ) I(θ) is the observed information. π2 πππππ(π₯|π) ππ 2 In our situation, we derive the observed information below. π2 I(πππ ) = − πππππ(π|π) ππππ 2 I(θ) = − = − π2 (ππ) (ππ) log (Pr (πππ 2 ∑ (πππ ππππ π,π (ππ) = 1|π)) + (1 − πππ (ππ) ) log (Pr (πππ = 0|π))) Since π (ππ) (ππ) (ππ) (ππ) ∑ (πππ log (Pr (πππ = 1|π)) + (1 − πππ ) log (Pr (πππ = 0|π))) ππππ π,π (ππ) = ∑ π,π πππ (ππ) Pr (πππ (ππ) 1 − πππ π π (ππ) (ππ) Pr (πππ = 1|π) + Pr (πππ = 0|π) (ππ) ππ ππ = 1|π) ππ Pr (π = 0|π) ππ ππ And π π (ππ) (ππ) (ππ) Pr (πππ = 1|π) = ((1 − fn) Pr (ππππ = 1|θ) + fp (1 − Pr (ππππ = 1|θ) )) ππππ ππππ Where π π (ππ) Pr (ππππ = 1|π) = (1 − ππππ ππππ = ∏ (ππ) π€ππ‘β(ππ) ππ·π π‘ (1 − ππ π‘ ) ∏ (ππ) ππππ‘πππ (ππ) ππ· π π‘ ( π€ππ‘βππ’π‘(ππ)) = ∏ (1 − ππ π‘ ) (ππ)∗ ππ·π π‘ ≠ππ Then, (ππ) πΏππ = π (ππ) Pr (πππ = 1|π) ππππ = (1 − fn − fp) ∏ (1 − ππ π‘ ) (ππ)∗ ππ·π π‘ ≠ππ Also, let 6 / 11 (1 − ππ π‘ )) Global optimization-based inference of chemogenomic features (ππ) (ππ) πππ = Pr (πππ Zu,S. et al. = 1|π) We can get π (ππ) (ππ) (ππ) (ππ) ∑ (πππ log (Pr (πππ = 1|π)) + (1 − πππ ) log (Pr (πππ = 0|π))) ππππ π,π (ππ) = ∑( π,π πππ (ππ) πππ (ππ) − 1 − πππ 1− (ππ) (ππ) πππ )πΏππ Then (ππ) (ππ) (ππ) (ππ) πππ 1 − πππ πππ 1 − πππ π (ππ) (ππ) π ∑( (ππ) − )πΏππ = ∑ πΏππ ( (ππ) − ) (ππ) (ππ) ππππ ππππ π π 1 − π 1 − π ππ ππ ππ ππ π,π π,π Note that, π (ππ) πΏ =0 ππππ ππ Besides, π (ππ) (ππ) π = πΏππ ππππ ππ (ππ) (ππ) (ππ) (ππ) πππ 1 − πππ πππ 1 − πππ π (ππ) (ππ) ( (ππ) − )=− πΏππ − πΏ 2 (ππ) (ππ) 2 ππ (ππ) ππππ π 1 − πππ (1 − πππ ) πππ ππ Finally, we have I(πππ ) = ∑ (π,π)∗ (ππ) (ππ) 2 πππ πΏππ ( (ππ) 2 πππ 7 / 11 (ππ) + 1 − πππ (ππ) (1 − πππ )2 ) Global optimization-based inference of chemogenomic features Zu,S. et al. Combinations of drug chemical substructures Different drug chemical substructures may take functions as one unit in the drug-protein interactions. We try to estimate the combination behaviors between two drug chemical substructures on drugprotein interactions through adding the pairs of drug chemical substructures as the new “drug chemical substructures”, we handle this problem by our probabilistic model. Instead of considering all the pairs of drug chemical substructures, we firstly use a filter method to select those pairs of drug chemical substructures that significantly appear in the drug-protein interactions. The filter method follows two steps: we use hypergeometric distribution to detect whether the co-appearing times of two drug chemical substructures are significant or not. Then we move out the pairs of drug chemical substructures that also significantly appear in the randomly selected compounds. The reason why we follow this procedure is that we are only interested in the combinations of drug chemical substructures that can interact with proteins but not often co-exists in the compound chemical space. The randomly selected compounds are from CHEMBL database, and in total, there are over 9,000 compounds representing the compound chemical space. We use the Bonferroni adjustment for the multiple test corrections here. Note that due to the PubChem substructures definition, we only consider the pairs with SMARTS records. Finally, we select 1870 pairs of drug chemical substructures for learning. 8 / 11 Global optimization-based inference of chemogenomic features Zu,S. et al. Estimation of fn and fp. In our model, two parameters, i.e., fn and fp, should be predefined. According to our model, fn = Pr(πππ = 0|ππππ = 1) Pr(πππ = 1, ππππ = 1) =1− Pr(ππππ = 1) Pr(πππ = 1) ≥1− Pr(ππππ = 1) ππ’ππππ ππ πππππ ππ£ππ πππ‘πππππ‘ππππ πππππ ππππ πππ‘ππππ‘ππππ ≈1− ππ’ππππ ππ π‘βπ ππππ πππ’π πππ ππππ‘πππ πππ‘πππππ‘ππππ It is shown that on average the number of target proteins per drug is about 6.3[8]. Then we can get, 4809 = 0.41 1863 × 6.3 We can estimate fp, which equals to Pr(πππ = 1|ππππ = 0), in the similar way. fp = Pr(πππ = 1|ππππ = 0) Pr(πππ = 1, ππππ = 0) = Pr(ππππ = 0) Pr(πππ = 1) ≤ Pr(ππππ = 0) ππ’πππ ππ πππ‘πππππ‘πππ πππππ ≈ ππ’ππππ ππ π‘ππ‘ππ πππππ − ππ’ππππ ππ πππ‘πππππ‘πππ πππππ fn ≥ 1 − 4809 1863 × 1554 − 4809 ≤ 1.67 × 10−3 = In order to analyze our model robustness to these parameters, we used five folds cross validation to detect the recoveries of drug-protein interactions on different combinations of fn and fp. It showed that the performances of recovering drug-protein interactions kept stable on different combinations of fn and fp. The procedure are followed: (i) split the original drug-target interactions equally to five fold. (ii)Each time, we select one of them as the test data set and use the others as the training set in our model. (iii) Estimate the test set after learning by the area under the operating characteristic curve (AUC). The curve is generated by plotting the false positive rate in the x-axis versus true positive rate. Note that the negative samples are randomly selected from the known non-interacted drug protein pairs, since we do not have the real negative samples. Results of the predicted drug-domain interactions 9 / 11 Global optimization-based inference of chemogenomic features Zu,S. et al. Protein Uniprot ID Compound PubChem ID Protein Domain Pfam ID k value Prediction by GIFT ITAL_HUMAN LKHA4_HUMAN LKHA4_HUMAN SRC_HUMAN SRC_HUMAN SRC_HUMAN CATS_HUMAN LDHA_HUMAN LKHA4_HUMAN PDE5A_HUMAN SRC_HUMAN ANDR_HUMAN ANDR_HUMAN ANDR_HUMAN ANDR_HUMAN DNMT1_HUMAN ESR1_HUMAN MMP3_HUMAN MMP8_HUMAN NOS3_HUMAN PDE10_HUMAN RXRA_HUMAN THRB_HUMAN TPA_HUMAN ADH1A_HUMAN ADH1B_HUMAN ADH1B_HUMAN ANDR_HUMAN ANDR_HUMAN ANDR_HUMAN DHSO_HUMAN ESR1_HUMAN ESR1_HUMAN ESR1_HUMAN ESR1_HUMAN ESR1_HUMAN ESR1_HUMAN ESR1_HUMAN ESR2_HUMAN ESR2_HUMAN GCR_HUMAN GCR_HUMAN NOS3_HUMAN OTC_HUMAN PDE5A_HUMAN PDE5A_HUMAN PRGR_HUMAN PRGR_HUMAN PRGR_HUMAN PRGR_HUMAN ROCK1_HUMAN RXRA_HUMAN THRB_HUMAN 53232 445154 90334 311 867 971 5287799 974 72172 110634 5287544 261000 3371 5803 5920 439155 5280961 1990 1990 2733 4680 82146 2332 2332 5287890 347402 80654 10635 56069 6013 132302 448577 449205 449207 449209 5035 5757 5870 5280961 5757 55245 5743 1893 124992 110635 5212 261000 4369524 5994 6230 3064778 444795 5326608 PF00092 PF01433 PF01433 PF00017 PF00017 PF00017 PF00112 PF02866 PF01433 PF00233 PF00017 PF00104 PF00104 PF00104 PF00104 PF00145 PF00104 PF00413 PF00413 PF02898 PF00233 PF00104 PF00089 PF00089 PF08240 PF08240 PF08240 PF00104 PF00104 PF00104 PF08240 PF00104 PF00104 PF00104 PF00104 PF00104 PF00104 PF00104 PF00104 PF00104 PF00104 PF00104 PF02898 PF00185 PF00233 PF00233 PF00104 PF00104 PF00104 PF00104 PF00069 PF00104 PF00089 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.70 0.90 1.00 0.81 0.91 0.90 0.80 0.83 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.53 0.57 0.58 0.91 0.84 0.91 0.53 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.90 1.00 1.00 0.50 1.00 1.00 1.00 1.00 1.00 1.00 0.94 1.00 1.00 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE Table 1. Results of the prediction on drug-domain interactions. k value is the proportion of the number of the binding site residues lying within the protein domain over the total number of the binding site residues. When k value was larger than 0.5, we treated the drug interacted with the domain. TRUE means the drug was predicted to interact with domain by GIFT. 10 / 11 Global optimization-based inference of chemogenomic features Zu,S. et al. Reference 1. Tabei, Y., Pauwels, E., Stoven, V., Takemoto, K., & Yamanishi, Y. (2012). Identification of chemogenomic features from drug–target interaction networks using interpretable classifiers. Bioinformatics, 28(18), i487-i494. 2. Bolton E, Wang Y, Thiessen PA, Bryant SH. PubChem: Integrated Platform of Small Molecules and Biological Activities. Chapter 12 IN Annual Reports in Computational Chemistry, Volume 4, Elsevier: Oxford, UK; 2008, pp. 217-240. 3. The Pfam protein families database: M. Punta, P.C. Coggill, R.Y. Eberhardt, J. Mistry, J. Tate, C. Boursnell, N. Pang, K. Forslund, G. Ceric, J. Clements, A. Heger, L. Holm, E.L.L. Sonnhammer, S.R. Eddy, A. Bateman, R.D. Finn Nucleic Acids Research (2012) Database Issue 40:D290-D301 4. Finn, R. D., Miller, B. L., Clements, J., & Bateman, A. (2014). iPfam: a database of protein family and domain interactions found in the Protein Data Bank. Nucleic acids research, 42(D1), D364-D373. 5. Kruger, F. A., Rostom, R., & Overington, J. P. (2012). Mapping small molecule binding data to structural domains. BMC bioinformatics, 13(Suppl 17), S11. 6. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal statistical Society, 39(1), 1-38. 7. Deng, M., Mehta, S., Sun, F., & Chen, T. (2002). Inferring domain–domain interactions from protein–protein interactions. Genome research, 12(10), 1540-1548. 8. Mestres, J., Gregori-Puigjane, E., Valverde, S., & Sole, R. V. (2008). Data completeness—the Achilles heel of drug-target networks. Nature biotechnology, 26(9), 983-984. 9. Efron, B., & Hinkley, D. V. (1978). Assessing the accuracy of the maximum likelihood estimator: Observed versus expected Fisher information.Biometrika, 65(3), 457-483. 11 / 11