Predicting Drug-gene and Drug-disease Networks using Functional Flow Bioinformatics Capstone Project School of Informatics Indiana University Bloomington, Indiana Ryan Tran Rene 2009 Purpose: Given putative drug associations with genes, find other drugs that may be associated with those genes. The method will be based on the similarity of the molecular fingerprints of drugs For each unique gene, Functional Flow will be used to determine which unannotated drugs are most likely to interact with that gene. Methods Algorithms Results and Conclusions Known drug-gene interactions Unique genes (pcid) Unique drugs (pcid) Daylight SMILES molecular fingerprints gNova; MACCS Tanimoto Scores T(u,v) Edges between nodes E(u,v): 0 or 1 For each unique gene: Functional Flow from annotated drugs (R=inf) To unannotated drugs (R=0) Large functional flows to unannotated drugs may indicate new drug-gene interactions Goal: To create 2 data bases mapping genes to drugs (PubChem ID) and diseases to drugs. PubChem ID to molecular fingerprints. 2 4 Tools for parsing & scripting: perl, awk, sed, UNIX, Excel, MATLAB (Log-Log), eliminate duplicate pairs, … 10 10 3 10 1 2 10 10 1 10 0 0 10 0 10 1 10 2 10 3 10 10 0 10 1 10 2 10 Matador (Gene Name + PubChem ID) DrugBank (HGNC ID number + PubChem ID) HGNC database (Gene Name to HGNC ID) Pdb (Pdb id number + Chemical compound name). UniProt (pdb id to HGNC id) Sucrose HGNC database (HGNC ID to Gene Name) script (chemical name to pubchem Id) PubChem ID =1115 PharmGKB (disease name to gene name) (disease name to drug PubChem ID) OC1C(OC(CO)C(O)C1O) Daylight SMILES (from PubChem ID) OC2(CO)OC(CO)C(O)C2O MACCS structural key molecular fingerprints (gNova; from SMILES) Known drug-gene interactions Unique genes (pcid) Unique drugs (pcid) Daylight SMILES molecular fingerprints gNova; MACCS Tanimoto Scores T(u,v) Edges between nodes E(u,v): 0 or 1 For each unique gene: Functional Flow from annotated drugs (R=inf) To unannotated drugs (R=0) Large functional flows to unannotated drugs may indicate new drug-gene interactions Tanimoto coefficient (extended Jaccard coefficient) T(u,v) = (u • v) / (||u||2 + ||v||2 - u • v) Molecular fingerprints (0’s and 1’s): u = (1,0,1,1,0,1,0,0,1) -> ||u||2 = u • u = 5 v = (0,1,1,1,1,0,1,0.1) -> ||v||2 = v • v = 6 (0,0,1,1,0,0,0,0,1) -> u • v = 3 T(u,v) = 3/(5+6-3) = 3/8 0 <= T(u,v) <= 1 Random fingerprints (N large): u = (1, 0, 1, 0, …., 1, 0, 1, 0) -> ||u||2 -> N/2 v = (1, 0, 0, 1, …., 1, 0, 0, 1) -> ||v||2 ->N/2 (1, 0, 0, 0, …., 1, 0, 0, 0) -> u • v ->N/4 T (u,v) -> (N/4)/(N/2+N/2-N/4) = 1/3 E(u,v) = { 1; T(u,v) >= threshhold 0; T(u,v) < threshhold Edges between nodes Known drug-gene interactions Unique genes (pcid) Unique drugs (pcid) Daylight SMILES molecular fingerprints gNova; MACCS Tanimoto Scores T(u,v) Edges between nodes E(u,v): 0 or 1 For each unique gene: Functional Flow from annotated drugs (R=inf) To unannotated drugs (R=0) Large functional flows to unannotated drugs may indicate new drug-gene interactions D1 Iterated Functional Flow g5,6 D2 D4 D6 D5 D8 D9 D7 D3 drug Annotated (Ro = ∞) 1st-iteration flow drug not annotated (Ro = 0) 2nd-iteration flow D1 Flow from Drug D5 (u) D6 2nd iteration: E(D2,D5) D2 u =D5, v=D6 R1(u) = 3 E/(u,v) = 1 • 3 /6 G1(u,v) = 1/2 D3 gta(u,v) = { D5 0 min[E(u,v),E/(u,v)] E/(u,v) = E(u,v) • Rt-1(u) / ΣE(u,y); E(D5,D8) D8 D7 ; Rt-1(v) > Rt-1(u) ; Rt-1(u) > Rt-1(v) ΣE/(u,y) = Rt-1(u) Note: Nabieva et al. (2005) accidently omitted Rt-1(u) from their published equation for E/(u,v). Functional Flow Input and Output Ra o (u) { = ∞ 0 ; node (drug) annotated for gene “a” ; else Input: Rao = (∞, 0, …, 0, ∞, ∞, 0, …, 0) E= 0 E1,2 E1,3 … E1,N E2,1 0 E2,3 … E2,N E3,1 E3,2 0 … E3,N ………………… EN,1 EN,2 … E1,N-1 0 Reservoirs increase by net flow into nodes: Rat(u) = Rat-1(u) + Σy gta (y,u) - Σy gta (u,y) functional score = sum of all flows into a node during all iterations: Output: fa (u) = Σt Σy gat(y,u) for t = 2 : d + 1 t-1 f(t, :) = f(t - 1, :); for u = 1 : N-1 for v = u+1 : N % no flow if E(u, v) = 0. if E(u, v) ~= 0.; if R(u) > R(v); % compute flow from u to v : ... g = min(E(u, v), R(u) * W(u, v) ); S(v) = S(v) + g ; S(u) = S(u) - g ; f(t, v) = f(t, v) + g ; Functional Flow Algorithm elseif R(v) > R(u); % compute flow from v to u : ... g = min(E(u, v), R(v) * W(v, u) ); S(u) = S(u) + g ; S(v) = S(v) - g ; f(t, u) = f(t, u) + g ; end end end end R(:) = S(:); ... end Functional Flow - Application and Tests genes drugs unique genes Drug Search (Application) Leave-one-out cross-validation Random numbers unique drugs annotated unannotated R=infinity Test Drugs R= infinity R=0 Test drugs R=0 sorted scores sorted* scores ranking 1 3 4 Repeat process for each gene associated with a minimal number of drugs Input Precision-recall plot Average over unique genes Precision & recall * Not necessary to sort scores for LOOCV Leave-one-out cross-validation (LOOCV) Information Retrieval: Precision = items found/ items retrieved Recall = items found/ items sought Classification: Precision = True Pos/(True Pos + False Pos) Recall = True Pos/(True Pos + False Neg) = True Pos/ # Positives F1 measure = 2 • prec • recall / (prec. + recall) Drug 2 Drug 3 k 1 2 3 4 5 6 7 k Prec. Recall 1 1/3 1/3 2 1/6 1/3 3 2/9 2/3 4 3/12 3/3 5 3/15 3/3 6 3/18 3/3 7 3/21 3/3 Higher rank Omit then rank Functional Flow for: Drug 1 F1 0.33 0.22 0.33 0.40 0.33 0.29 0.25 LOOCV results (Classifications) k=1 FP TN FN TN TN TN TN TP TN TN TN TN TN TN FP TN TN FN TN TN TN k 1 2 3 4 5 6 7 FP FP FP FN TN TN TN k 1 2 3 4 5 6 7 k=3 FP FP TP TN TN TN TN TP FP FP TN TN TN TN Precision = TP/(TP+FP) Recall = TP/(TP+FN) = TP / (# positives) k=2 FP FP FN TN TN TN TN TP FP TN TN TN TN TN TN FP TN FN TN TN TN k 1 2 3 4 5 6 7 FP FP FP TP TN TN TN k 1 2 3 4 5 6 7 k=4 FP FP TP FP TN TN TN FP FP FP FP TN TN TN Precision = items found/ items retrieved = TP/(TP+FP) Recall = items found/ items sought = TP/(TP+FN) Information Retrieval Classifications k=1 FP TP FP FN FN k=3 FP FP TP TP FP FP FP FP FP FN k 1 2 3 4 5 6 7 k 1 2 3 4 5 6 7 k=2 FP FP FN TP FP TN FP FN k=4 FP FP TP FP TP FP FP FP TN FP FP TP k 1 2 3 4 5 6 7 k 1 2 3 4 5 6 7 LOOCV Results Precision-Recall Plots: Leave-One-Out cross-validation for rankings k of 1 through 50; averages for genes to which LOOCV was applied Random Rankings Parameters: Minimum number of annotated drugs Number of functional flow iterations Tanimoto threshhold for non-zero edge Comparison of 4 vs. 10 iterations for a minimum of 25 annotated drugs/unique gene and a Tanimoto threshold of 80% threshold 80, annotated 25, intervals 4 threshold 80, annotated 25, intervals 10 0.3 0.35 test random 0.3 0.25 0.25 0.2 0.2 recall Recall Recall recall 0.35 0.15 0.15 0.1 0.1 0.05 0.05 0 0 0.01 0.02 precision Precision 0.03 0.04 te ra 0 0 0.005 0.01 0.015 precision 0.02 0.025 Precision 10 iterations is too many (low precision). Note: prec.(1) = recall(1) Comparison of 4 vs. 8 iterations for a minimum of 50 annotated drugs/unique gene and a Tanimoto threshold of 80% threshold 80, annotated 50, intervals 4 threshold 80, annotated 50, iterations 8 0.45 0.4 test random0.4 0.35 0.35 0.3 0.3 0.25 0.25 Recall recall Recall recall 0.45 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 0.01 0.02 0.03 precision Precision 0.04 0.05 tes ran 0 0 0.01 0.03 0.02 precision 0.04 0.05 Precision 8 iterations is too many (low precision). Note again: For the top-ranked LOOCV functional flow scores precision equals recall (k = 1). Comparison of 25 vs. 50 minimum numbers of annotated drugs/unique gene (for 4 iterations and a Tanimoto threshold of 80%) threshold 80, annotated 25, intervals 4 0.35 threshold 80, annotated 50, intervals 4 0.45 test random Effects of averaging KMAX= min(50, #annotated drugs-1) 0.3 0.25 0.4 0.35 0.2 Recall recall Recall recall 0.3 0.15 0.25 0.2 0.15 0.1 0.1 0.05 0 0.05 0 0.01 0.02 precision Precision 0.03 0.04 0 0 0.01 0.02 0.03 precision Precision Requiring at least 50 annotated drugs increased precision and recall significantly 0.04 0.05 Comparison of 60 vs. 80% Tanimoto thresholds (for 4 iterations and a minimum number of 50 annotated drugs/unique gene) threshold 80, annotated 50, intervals 4 threshold 60, annotated 50, intervals 4 0.35 0.45 test random 0.4 0.3 0.35 0.25 0.3 Recall recall Recall recall 0.2 0.15 0.25 0.2 0.15 0.1 0.1 0.05 0 0.05 0 0 0.005 0.01 precision Precision 0.015 0.02 0 0.01 0.02 0.03 precision 0.04 Precision Increasing the Tanimoto score threshold from 60% to 80% doubled the precision. 0.05 Comparison of 25 vs. 50 minimum numbers of annotated drugs/unique gene (for 4 iterations and a Tanimoto threshold of 60%) threshold 60, annotated 50, intervals 4 threshold 60, annotated 25, intervals 4 0.35 0.35 Effects of averaging KMAX= min(50, #annotated drugs-1) 0.3 0.25 0.25 0.2 0.2 Recall recall Recall recall 0.3 test random 0.15 0.15 0.1 0.1 0.05 0.05 0 0 0.005 0.01 0.015 precision Precision 0.02 0.025 0 0 0.005 0.01 precision Precision 0.015 0.02 For Tanimoto score threshold of 60% the precision is low. The results are quite variable for k > 28 with fewer annotated drugs. Comparison of 10 vs. 25 minimum numbers of annotated drugs/unique gene (for 4 iterations and a Tanimoto threshold of 80%) threshold 80, annotated 25, intervals 4 threshold 80, annotated 10, intervals 4 0.35 test random 0.3 0.3 0.25 0.25 0.2 0.2 Recall recall recall Recall 0.35 0.15 0.15 0.1 0.1 0.05 0.05 0 0 0 0.005 0.01 0.015 0.02 precision 0.025 Precision 0.03 0.035 0 0.01 0.02 precision 0.03 0.04 Precision Requiring at least 25 annotated drugs increased precision significantly, but predictions using fewer annotated drugs may nevertheless be useful Comparison of 70 vs. 80% Tanimoto thresholds (for 4 iterations and a minimum number of 25 annotated drugs/unique gene) threshold 80, annotated 25, intervals 4 threshold 70, annotated 25, intervals 4 0.35 0.35 test random 0.3 0.3 Effects of averaging KMAX= min(50, #annotated drugs-1) 0.25 0.2 Recall recall Recall recall 0.25 0.15 0.2 0.15 0.1 0.1 0.05 0.05 0 0 0 0.01 0.02 0.03 precision Precision 0.04 0.05 0 0.01 0.02 precision 0.03 Precision Increasing the Tanimoto score threshold from 70% to 80% decreased the precision for the top ranked scores (k=1). 0.04 Using Clustered Drugs: Comparison of 60 vs. 70% Tanimoto thresholds (for 4 iterations and a minimum number of 25 annotated drugs/unique gene; graphconncomp) cluster threshold 70, annotated 25, iterations 4 cluster threshold 60, annotated 25, iterations 4 0.35 0.45 test random test random 0.4 0.3 0.35 0.25 Recall 0.25 recall 0.2 recall Recall 0.3 0.15 0.2 0.15 0.1 0.1 0.05 0.05 0 0 0 0.005 0.01 0.015 precision Precision 0.02 0.025 0.03 0 0.01 0.02 0.03 0.04 precision 0.05 Precision Average Precision of > 6% achieved for top-ranked drugs (k=1) using clustered drugs only 0.06 0.07 Using Clustered Drugs: 70% Tanimoto thresholds (for 6 iterations and a minimum number of 20 annotated drugs/unique gene) cluster threshold 70, annotated 20, iterations 6 0.45 test random 0.4 Effects of averaging KMAX= min(50, #annotated drugs-1) 0.35 Recall recall 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.01 0.02 0.03 0.04 precision 0.05 0.06 0.07 Precision Average Precision of > 6% achieved for top-ranked drugs (k=1) using clustered drugs only Disease to Drugs: 80% Tanimoto threshold 4 iterations and a minimum number of 50 annotated drugs/unique disease) Disease to Drugs 80% threashold 50 annotations 4 intervals 0.35 test random 0.3 0.25 Recall recall 0.2 0.15 0.1 0.05 0 0 0.005 0.01 0.015 precision 0.02 0.025 0.03 Precision Average precision for top ranks (k=1) is only 2%, but LOOCV precison is double that of random model for k < 10. Conclusions With Tanimoto thresholds of 70-80% and relatively few iterations (~4), Functional Flow may be useful to predicting new drugs that will interact with genes and diseases. If you look at more rankings you find more drugs, but you have to test more drugs Descisions on parameters will depend on the economics of trading less precision for greater recall (increasing k) and the performance of Leave-One-Out Cross-Validation (LOOCV) for the genes and diseases that are of most interest. References Brown, R. D.; Martin, Y.C., 1996, Use of structure-activity data To compare structure-based clustering methods and descriptors for use in compound selection: J. Chem. Inf. Compu. Sci, 36, 572-584. Gunther, et al., 2007, Super target and Matador: resources for exploring drug-target relationships, Nucleic Acids Research, 1-4 Nabieva, et al., 2005, Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps: bioinformatics, 21, Suppl. 1, 2005, i302–i310. MacCuish , J. D., and MacCuish, N. E., 2003, Mesa Suite Version 1.2: Fingerprint Module: Mesa Analytics & Computing, LLC Acknowledgments Special thanks to Drs. Predrag Radivojac, David Wild, Sun Kim, Mehemet Dalkilic, Rajarshi Guha, Haixu Tang and the faculty of Bioinformatics and Cheminformatics. Also thanks to Jefferson Davis (Math/Stat), Bob Konicek, and of course Linda Hostetter. Thank you all and enjoy the rest of the summer!