Supplementary Methods - Word file (48 KB )

advertisement
Supplement
Transforming the global consistency problem into a
minimum cut problem
We are given a functional linkage network G with n nodes and m edges. We apply our
transformation and algorithms once for each function of interest. We associate a state xi with each
node i between 1 and n. For a specific function , we set xi = 1 if protein i is annotated with the
function, xi = -1 if protein i is annotated with a different function, and xi = 0 otherwise. Our goal
is to assign a state of 1 or –1 to every node i with initial state xi = 0 so that the following
“energy” is minimized:
E  
i
w
ij
xi x j
j
In this equation, wij is the weight of the edge connecting node i and node j. When the weight of
each edge is positive, minimizing E minimizes the total weight of inconsistent edges in G, i.e., the
total weight of edges connecting nodes with different states.
We construct a new graph H from G. Each node of H corresponds to a node of G. H also contains
two new nodes: a source node s and a sink node t. Each edge in H is directed. We create the edges
as follows:
1. for each node i in G with xi=1, we connect node s to node i with an edge directed from s
to i.
2. for each node i in G with xi=-1, we connect node i to node t with an edge directed from i
to t.
3. for each edge in G connecting node i and node j such that xi=1 and xj=0, we connect node
i to node j with an edge directed from i to j.
4. for each edge in G connecting node i and node j such that xi=0 and xj=-1, we connect
node i to node j with an edge directed from i to j.
5. for each edge in G connecting node i and node j such that xi=0 and xj=0, we connect node
i to node j with two edges, one directed from i to j and the other directed from j to i.
6. We ignore any edge in G that is not incident on a node in state 0.
We set the weight of the edges in H incident on s and t to be infinity. Every other edge has the
same weight as the corresponding edge in G.
An s-t cut in the graph H is a partition of the nodes of H into two sets S and T, where S contains
the node s and T contains the node t. An edge (u, v) crosses the cut if u lies in S and v lies in T.
The weight of the cut is the total weight of the edges crossing the cut. We compute the s-t cut C
with smallest weight in H. For each node i that is in S, we set xi=1. We set xi=-1 for every node i
in T. The edges incident on s and t (edges of types 1 and 2) do not participate in C since they have
weight infinity. Hence, the only edges used in C are incident on nodes with state equal to 0. Each
edge in C of type 3 or 4 corresponds to one inconsistent edge in G. At most one edge in every
pair of edges of type 5 can belong to C; by definition, the edge in the pair directed from a node in
T to a node in S does not belong to the cut. Therefore each edge in C corresponds to exactly one
inconsistent edge in G. Conversely, each inconsistent edge in G has exactly one corresponding
edge in C. Therefore, since C has the smallest weight among all s-t cuts, this state assignment
minimizes the total weight of inconsistent edges in G. To compute the minimum cut, we use the
software developed by Cherkassky and Goldberg1 to implement the Golbderg-Tarjan algorithm
for computing maximum flows and minimum cuts in a graph.2 This algorithm runs in O(nm
log(n2/m)) time.
The following figure depicts an example of this transformation. The graph on the top is G and the
graph on the bottom is H. All edges in G have weight 1. The minimum cut in H is the edge
connecting the node in state 0 to the node in state –1. Thus, the minimum cut algorithm assigns a
state of 1 to both nodes in state 0.

1
0
1
0
1
-1
0
s
-1

t

1
0
Figure: Transforming the functional linkage graph into a directed graph suitable for computing a
minimum cut.
Cross Validation Results
We used the following datasets in our study. We obtained protein interactions for Saccharomyces
cerevisiae from the General Repository of Interaction Datasets. Each protein interaction
corresponds to an edge in the graph G. We obtained functional annotations for S. cerevisiae from
the Gene Ontology (GO). We modified these annotations as in our earlier work:3
1. If a function f annotates a gene g, then we annotate g with every ancestor f’ of f in the
directed acyclic graph defined by the parent-child relationships between the functions in
GO. In other words, when we apply our algorithms to predict function f’, the node
corresponding to gene g is in state 1.
2. If a function f is the most specific annotation for a gene g, i.e., no descendant f’ of f
annotates g, then we set the annotation status of g with respect to each descendant f’ of f
as unknown. In other words, when we apply our algorithms to predict function f’, the
node corresponding to gene g is in state 0.
We report results for 82 functions in the Gene Ontology that yielded high precision and recall in
our previous study;3 proteins predicted to have these functions are good candidates for
experimental validation. We obtained these functions as follows. In our earlier study, 3 we
considered a subset of 1000 protein interactions in the GRID data set that were reported by at
least two publications. We integrated this dataset with a compendium of gene expression profiles.
We reported predicted annotations only for those functions than annotated no more than 100
proteins and had at least 75% precision and recall during cross-validation. There were 97 such
functions. Since that time, the GO consortium has declared 15 of these functions to be “obsolete.”
In this work, we report results for the remaining 82 functions. Note that in this work, we apply the
min-cut and local rule algorithms to the network formed only by protein interactions; we do not
integrate gene expression data with the protein interactions.
We apply the two algorithms one function at a time. For each function f, we set the state of node
to 1 if the corresponding protein is annotated with f, to –1 if the protein is annotated with a
protein other than f, and to 0 otherwise. We use leave one out cross validation to test the methods.
For each node in state 1, we set its state to 0, execute the min-cut algorithm and the guilt-byassociation algorithm, and separately record whether each algorithm correctly predicted the state
of this node. We repeat this process for an equal number of nodes in state –1, chosen randomly
from the set of all nodes in state –1. For each function, we use the same set of randomly selected
nodes for both algorithms. For each function, we report cross-validation results using two
measures of accuracy: precision, the ratio of the number of nodes correctly predicted to have state
1 to the total number of nodes predicted to have state 1, and recall, the ratio of the number of
nodes correctly predicted to have state 1 to the number of nodes in state 1. The accompanying
tab-delimited file contains these results. For seven functions, we obtain a precision and recall of
0. There are two reasons why our previous work reported high precision and recall for these
functions:
1. we considered only a subset of 1000 protein interactions (those reported by at least two
publications) .
2. we integrated gene expression data with these interactions.
In this work, we consider the full set of protein interactions in the GRID dataset. We do not
integrate gene expression data either. Therefore, our algorithms must cope with many spurious
interactions that genome-scale experiments are known to report. As a result, we obtain poor cross
validation results for some functions.
Additional information and the data used for this analysis can be found at
http://bioinformatics.cs.vt.edu/~murali/papers/art-of-gene-function-prediction.html.
References
1. Cherkassky BV and Goldberg AV. On Implementing the Push-Relabel Method for the
Maximum Flow Problem. Algorithmica, 19 (4): 390-410, 1997. Software available at
http://www.avglab.com/andrew/soft.html.
2. Goldberg AV and Tarjan RE. A New Approach to the Maximum-Flow Problem. Journal of
the ACM, 35(4): 921--940, 1988.
3. Karaoz U, Murali TM, Letovsky S, Zheng Y, Ding C, Cantor CR, and Kasif S. Whole
genome annotation using evidence integration in functional linkage networks
Proceedings of the National Academy of Sciences, pages 2888--2893, 2004.
Download