Presentation @ 3:30pm - Bioinformatics at School of Informatics

advertisement
Predicting Drug-gene and Drug-disease Networks
using Functional Flow
Bioinformatics Capstone Project
School of Informatics
Indiana University
Bloomington, Indiana
Ryan Tran Rene
2009
Purpose: Given putative drug associations with genes,
find other drugs that may be associated with
those genes.
The method will be based on the similarity
of the molecular fingerprints of drugs
For each unique gene, Functional Flow will be used to
determine which unannotated drugs are most likely to
interact with that gene.
Methods
Algorithms
Results and Conclusions
Known drug-gene interactions
Unique genes (pcid)
Unique drugs (pcid)
Daylight SMILES
molecular fingerprints
gNova; MACCS
Tanimoto Scores T(u,v)
Edges between nodes
E(u,v): 0 or 1
For each unique gene:
Functional Flow
from annotated drugs (R=inf)
To unannotated drugs (R=0)
Large functional flows
to unannotated drugs
may indicate new
drug-gene interactions
Goal: To create 2 data bases mapping
genes to drugs (PubChem ID) and diseases to drugs.
PubChem ID to molecular fingerprints.
2
4
Tools for parsing & scripting: perl, awk, sed, UNIX, Excel,
MATLAB (Log-Log), eliminate duplicate pairs, …
10
10
3
10
1
2
10
10
1
10
0
0
10
0
10
1
10
2
10
3
10
10
0
10
1
10
2
10
Matador (Gene Name + PubChem ID)
DrugBank (HGNC ID number + PubChem ID)
HGNC database (Gene Name to HGNC ID)
Pdb (Pdb id number + Chemical compound name).
UniProt (pdb id to HGNC id)
Sucrose
HGNC database (HGNC ID to Gene Name)
script (chemical name to pubchem Id)
PubChem ID =1115
PharmGKB (disease name to gene name)
(disease name to drug PubChem ID)
OC1C(OC(CO)C(O)C1O)
Daylight SMILES (from PubChem ID)
OC2(CO)OC(CO)C(O)C2O
MACCS structural key molecular fingerprints (gNova; from SMILES)
Known drug-gene interactions
Unique genes (pcid)
Unique drugs (pcid)
Daylight SMILES
molecular fingerprints
gNova; MACCS
Tanimoto Scores T(u,v)
Edges between nodes
E(u,v): 0 or 1
For each unique gene:
Functional Flow
from annotated drugs (R=inf)
To unannotated drugs (R=0)
Large functional flows
to unannotated drugs
may indicate new
drug-gene interactions
Tanimoto coefficient (extended Jaccard coefficient)
T(u,v) = (u • v) / (||u||2 + ||v||2 - u • v)
Molecular fingerprints (0’s and 1’s):
u = (1,0,1,1,0,1,0,0,1) -> ||u||2 = u • u = 5
v = (0,1,1,1,1,0,1,0.1) -> ||v||2 = v • v = 6
(0,0,1,1,0,0,0,0,1) -> u • v = 3
T(u,v) = 3/(5+6-3) = 3/8
0 <= T(u,v) <= 1
Random fingerprints (N large):
u = (1, 0, 1, 0, …., 1, 0, 1, 0) -> ||u||2 -> N/2
v = (1, 0, 0, 1, …., 1, 0, 0, 1) -> ||v||2 ->N/2
(1, 0, 0, 0, …., 1, 0, 0, 0) -> u • v ->N/4
T (u,v) -> (N/4)/(N/2+N/2-N/4) = 1/3
E(u,v) =
{
1; T(u,v) >= threshhold
0; T(u,v) < threshhold
Edges between nodes
Known drug-gene interactions
Unique genes (pcid)
Unique drugs (pcid)
Daylight SMILES
molecular fingerprints
gNova; MACCS
Tanimoto Scores T(u,v)
Edges between nodes
E(u,v): 0 or 1
For each unique gene:
Functional Flow
from annotated drugs (R=inf)
To unannotated drugs (R=0)
Large functional flows
to unannotated drugs
may indicate new
drug-gene interactions
D1
Iterated Functional Flow
g5,6
D2
D4
D6
D5
D8
D9
D7
D3
drug
Annotated (Ro = ∞)
1st-iteration flow
drug
not annotated (Ro = 0)
2nd-iteration flow
D1
Flow from Drug D5 (u)
D6
2nd iteration:
E(D2,D5)
D2
u =D5, v=D6
R1(u) = 3
E/(u,v) = 1 • 3 /6
G1(u,v) = 1/2
D3
gta(u,v)
=
{
D5
0
min[E(u,v),E/(u,v)]
E/(u,v) = E(u,v) • Rt-1(u) / ΣE(u,y);
E(D5,D8)
D8
D7
; Rt-1(v) > Rt-1(u)
; Rt-1(u) > Rt-1(v)
ΣE/(u,y) = Rt-1(u)
Note: Nabieva et al. (2005) accidently omitted Rt-1(u)
from their published equation for E/(u,v).
Functional Flow Input and Output
Ra
o (u)
{
=
∞
0
; node (drug) annotated for gene “a”
; else
Input: Rao = (∞, 0, …, 0, ∞, ∞, 0, …, 0)
E=
0 E1,2 E1,3 … E1,N
E2,1 0 E2,3 … E2,N
E3,1 E3,2 0 … E3,N
…………………
EN,1 EN,2 … E1,N-1 0
Reservoirs increase by net flow into nodes:
Rat(u) = Rat-1(u) + Σy gta (y,u) - Σy gta (u,y)
functional score = sum of all flows into a node during all iterations:
Output:
fa (u) = Σt Σy gat(y,u)
for t = 2 : d + 1
t-1
f(t, :) = f(t - 1, :);
for u = 1 : N-1
for v = u+1 : N
% no flow if E(u, v) = 0.
if E(u, v) ~= 0.;
if R(u) > R(v);
% compute flow from u to v :
...
g = min(E(u, v), R(u) * W(u, v) );
S(v) = S(v) + g ;
S(u) = S(u) - g ;
f(t, v) = f(t, v) + g ;
Functional Flow Algorithm
elseif R(v) > R(u);
% compute flow from v to u :
...
g = min(E(u, v), R(v) * W(v, u) );
S(u) = S(u) + g ;
S(v) = S(v) - g ;
f(t, u) = f(t, u) + g ;
end
end
end
end
R(:) = S(:);
...
end
Functional Flow - Application and Tests
genes
drugs
unique
genes
Drug Search
(Application)
Leave-one-out
cross-validation
Random
numbers
unique drugs
annotated unannotated
R=infinity
Test Drugs
R= infinity
R=0
Test drugs
R=0
sorted
scores
sorted*
scores
ranking
1
3
4
Repeat process for each gene
associated with a minimal
number of drugs
Input
Precision-recall plot
Average over unique genes
Precision & recall
* Not necessary to sort scores for LOOCV
Leave-one-out
cross-validation
(LOOCV)
Information Retrieval:
Precision = items found/ items retrieved
Recall = items found/ items sought
Classification:
Precision = True Pos/(True Pos + False Pos)
Recall = True Pos/(True Pos + False Neg)
= True Pos/ # Positives
F1 measure = 2 • prec • recall / (prec. + recall)
Drug 2
Drug 3
k
1
2
3
4
5
6
7
k Prec. Recall
1
1/3
1/3
2
1/6
1/3
3
2/9
2/3
4
3/12
3/3
5
3/15
3/3
6
3/18
3/3
7
3/21
3/3
Higher rank
Omit then rank Functional Flow for: Drug 1
F1
0.33
0.22
0.33
0.40
0.33
0.29
0.25
LOOCV results
(Classifications)
k=1
FP
TN
FN
TN
TN
TN
TN
TP
TN
TN
TN
TN
TN
TN
FP
TN
TN
FN
TN
TN
TN
k
1
2
3
4
5
6
7
FP
FP
FP
FN
TN
TN
TN
k
1
2
3
4
5
6
7
k=3
FP
FP
TP
TN
TN
TN
TN
TP
FP
FP
TN
TN
TN
TN
Precision = TP/(TP+FP)
Recall = TP/(TP+FN) = TP / (# positives)
k=2
FP
FP
FN
TN
TN
TN
TN
TP
FP
TN
TN
TN
TN
TN
TN
FP
TN
FN
TN
TN
TN
k
1
2
3
4
5
6
7
FP
FP
FP
TP
TN
TN
TN
k
1
2
3
4
5
6
7
k=4
FP
FP
TP
FP
TN
TN
TN
FP
FP
FP
FP
TN
TN
TN
Precision = items found/ items retrieved = TP/(TP+FP)
Recall = items found/ items sought = TP/(TP+FN)
Information Retrieval
Classifications
k=1
FP
TP
FP
FN
FN
k=3
FP
FP
TP
TP
FP
FP
FP
FP
FP
FN
k
1
2
3
4
5
6
7
k
1
2
3
4
5
6
7
k=2
FP
FP
FN
TP
FP
TN
FP
FN
k=4
FP
FP
TP
FP
TP
FP
FP
FP
TN
FP
FP
TP
k
1
2
3
4
5
6
7
k
1
2
3
4
5
6
7
LOOCV Results
Precision-Recall Plots:
Leave-One-Out cross-validation for rankings
k of 1 through 50; averages for genes to which
LOOCV was applied
Random Rankings
Parameters:
Minimum number of annotated drugs
Number of functional flow iterations
Tanimoto threshhold for non-zero edge
Comparison of 4 vs. 10 iterations
for a minimum of 25 annotated drugs/unique gene
and a Tanimoto threshold of 80%
threshold 80, annotated 25, intervals 4
threshold 80, annotated 25, intervals 10
0.3
0.35
test
random
0.3
0.25
0.25
0.2
0.2
recall
Recall
Recall
recall
0.35
0.15
0.15
0.1
0.1
0.05
0.05
0
0
0.01
0.02
precision
Precision
0.03
0.04
te
ra
0
0
0.005
0.01
0.015
precision
0.02
0.025
Precision
10 iterations is too many (low precision). Note: prec.(1) = recall(1)
Comparison of 4 vs. 8 iterations
for a minimum of 50 annotated drugs/unique gene
and a Tanimoto threshold of 80%
threshold 80, annotated 50, intervals 4
threshold 80, annotated 50, iterations 8
0.45
0.4
test
random0.4
0.35
0.35
0.3
0.3
0.25
0.25
Recall
recall
Recall
recall
0.45
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0
0.01
0.02
0.03
precision
Precision
0.04
0.05
tes
ran
0
0
0.01
0.03
0.02
precision
0.04
0.05
Precision
8 iterations is too many (low precision). Note again: For the top-ranked
LOOCV functional flow scores precision equals recall (k = 1).
Comparison of 25 vs. 50 minimum
numbers of annotated drugs/unique gene
(for 4 iterations and a Tanimoto threshold of 80%)
threshold 80, annotated 25, intervals 4
0.35
threshold 80, annotated 50, intervals 4
0.45
test
random
Effects of averaging
KMAX= min(50,
#annotated drugs-1)
0.3
0.25
0.4
0.35
0.2
Recall
recall
Recall
recall
0.3
0.15
0.25
0.2
0.15
0.1
0.1
0.05
0
0.05
0
0.01
0.02
precision
Precision
0.03
0.04
0
0
0.01
0.02
0.03
precision
Precision
Requiring at least 50 annotated drugs increased
precision and recall significantly
0.04
0.05
Comparison of 60 vs. 80% Tanimoto thresholds
(for 4 iterations and a minimum number of
50 annotated drugs/unique gene)
threshold 80, annotated 50, intervals 4
threshold 60, annotated 50, intervals 4
0.35
0.45
test
random
0.4
0.3
0.35
0.25
0.3
Recall
recall
Recall
recall
0.2
0.15
0.25
0.2
0.15
0.1
0.1
0.05
0
0.05
0
0
0.005
0.01
precision
Precision
0.015
0.02
0
0.01
0.02
0.03
precision
0.04
Precision
Increasing the Tanimoto score threshold from 60% to 80%
doubled the precision.
0.05
Comparison of 25 vs. 50 minimum
numbers of annotated drugs/unique gene
(for 4 iterations and a Tanimoto threshold of 60%)
threshold 60, annotated 50, intervals 4
threshold 60, annotated 25, intervals 4
0.35
0.35
Effects of averaging
KMAX= min(50,
#annotated drugs-1)
0.3
0.25
0.25
0.2
0.2
Recall
recall
Recall
recall
0.3
test
random
0.15
0.15
0.1
0.1
0.05
0.05
0
0
0.005
0.01
0.015
precision
Precision
0.02
0.025
0
0
0.005
0.01
precision
Precision
0.015
0.02
For Tanimoto score threshold of 60% the precision is low.
The results are quite variable for k > 28 with fewer annotated drugs.
Comparison of 10 vs. 25 minimum
numbers of annotated drugs/unique gene
(for 4 iterations and a Tanimoto threshold of 80%)
threshold 80, annotated 25, intervals 4
threshold 80, annotated 10, intervals 4
0.35
test
random
0.3
0.3
0.25
0.25
0.2
0.2
Recall
recall
recall
Recall
0.35
0.15
0.15
0.1
0.1
0.05
0.05
0
0
0
0.005
0.01
0.015 0.02
precision
0.025
Precision
0.03
0.035
0
0.01
0.02
precision
0.03
0.04
Precision
Requiring at least 25 annotated drugs increased precision significantly,
but predictions using fewer annotated drugs may nevertheless be useful
Comparison of 70 vs. 80% Tanimoto thresholds
(for 4 iterations and a minimum number of
25 annotated drugs/unique gene)
threshold 80, annotated 25, intervals 4
threshold 70, annotated 25, intervals 4
0.35
0.35
test
random
0.3
0.3
Effects of averaging
KMAX= min(50,
#annotated drugs-1)
0.25
0.2
Recall
recall
Recall
recall
0.25
0.15
0.2
0.15
0.1
0.1
0.05
0.05
0
0
0
0.01
0.02
0.03
precision
Precision
0.04
0.05
0
0.01
0.02
precision
0.03
Precision
Increasing the Tanimoto score threshold from 70% to 80%
decreased the precision for the top ranked scores (k=1).
0.04
Using Clustered Drugs: Comparison of 60 vs. 70% Tanimoto thresholds
(for 4 iterations and a minimum number of
25 annotated drugs/unique gene; graphconncomp)
cluster threshold 70, annotated 25, iterations 4
cluster threshold 60, annotated 25, iterations 4
0.35
0.45
test
random
test
random
0.4
0.3
0.35
0.25
Recall
0.25
recall
0.2
recall
Recall
0.3
0.15
0.2
0.15
0.1
0.1
0.05
0.05
0
0
0
0.005
0.01
0.015
precision
Precision
0.02
0.025
0.03
0
0.01
0.02
0.03
0.04
precision
0.05
Precision
Average Precision of > 6% achieved for top-ranked
drugs (k=1) using clustered drugs only
0.06
0.07
Using Clustered Drugs: 70% Tanimoto thresholds
(for 6 iterations and a minimum number of
20 annotated drugs/unique gene)
cluster threshold 70, annotated 20, iterations 6
0.45
test
random
0.4
Effects of averaging
KMAX= min(50,
#annotated drugs-1)
0.35
Recall
recall
0.3
0.25
0.2
0.15
0.1
0.05
0
0
0.01
0.02
0.03
0.04
precision
0.05
0.06
0.07
Precision
Average Precision of > 6% achieved for top-ranked
drugs (k=1) using clustered drugs only
Disease to Drugs: 80% Tanimoto threshold
4 iterations and a minimum number of
50 annotated drugs/unique disease)
Disease to Drugs 80% threashold 50 annotations 4 intervals
0.35
test
random
0.3
0.25
Recall
recall
0.2
0.15
0.1
0.05
0
0
0.005
0.01
0.015
precision
0.02
0.025
0.03
Precision
Average precision for top ranks (k=1) is only 2%, but
LOOCV precison is double that of random model for k < 10.
Conclusions
With Tanimoto thresholds of 70-80% and relatively few
iterations (~4), Functional Flow may be useful to predicting
new drugs that will interact with genes and diseases.
If you look at more rankings you find more drugs,
but you have to test more drugs
Descisions on parameters will depend on the economics of trading
less precision for greater recall (increasing k) and the
performance of Leave-One-Out Cross-Validation (LOOCV)
for the genes and diseases that are of most interest.
References
Brown, R. D.; Martin, Y.C., 1996, Use of structure-activity data
To compare structure-based clustering methods and descriptors
for use in compound selection: J. Chem. Inf. Compu. Sci, 36,
572-584.
Gunther, et al., 2007, Super target and Matador: resources for
exploring drug-target relationships, Nucleic Acids Research, 1-4
Nabieva, et al., 2005, Whole-proteome prediction of protein function
via graph-theoretic analysis of interaction maps: bioinformatics, 21,
Suppl. 1, 2005, i302–i310.
MacCuish , J. D., and MacCuish, N. E., 2003, Mesa Suite Version
1.2: Fingerprint Module: Mesa Analytics & Computing, LLC
Acknowledgments
Special thanks to Drs. Predrag Radivojac, David Wild,
Sun Kim, Mehemet Dalkilic, Rajarshi Guha, Haixu Tang
and the faculty of Bioinformatics and Cheminformatics.
Also thanks to Jefferson Davis (Math/Stat), Bob Konicek,
and of course Linda Hostetter.
Thank you all and enjoy the rest of the summer!
Download