Supplement Transforming the global consistency problem into a minimum cut problem We are given a functional linkage network G with n nodes and m edges. We apply our transformation and algorithms once for each function of interest. We associate a state xi with each node i between 1 and n. For a specific function , we set xi = 1 if protein i is annotated with the function, xi = -1 if protein i is annotated with a different function, and xi = 0 otherwise. Our goal is to assign a state of 1 or –1 to every node i with initial state xi = 0 so that the following “energy” is minimized: E i w ij xi x j j In this equation, wij is the weight of the edge connecting node i and node j. When the weight of each edge is positive, minimizing E minimizes the total weight of inconsistent edges in G, i.e., the total weight of edges connecting nodes with different states. We construct a new graph H from G. Each node of H corresponds to a node of G. H also contains two new nodes: a source node s and a sink node t. Each edge in H is directed. We create the edges as follows: 1. for each node i in G with xi=1, we connect node s to node i with an edge directed from s to i. 2. for each node i in G with xi=-1, we connect node i to node t with an edge directed from i to t. 3. for each edge in G connecting node i and node j such that xi=1 and xj=0, we connect node i to node j with an edge directed from i to j. 4. for each edge in G connecting node i and node j such that xi=0 and xj=-1, we connect node i to node j with an edge directed from i to j. 5. for each edge in G connecting node i and node j such that xi=0 and xj=0, we connect node i to node j with two edges, one directed from i to j and the other directed from j to i. 6. We ignore any edge in G that is not incident on a node in state 0. We set the weight of the edges in H incident on s and t to be infinity. Every other edge has the same weight as the corresponding edge in G. An s-t cut in the graph H is a partition of the nodes of H into two sets S and T, where S contains the node s and T contains the node t. An edge (u, v) crosses the cut if u lies in S and v lies in T. The weight of the cut is the total weight of the edges crossing the cut. We compute the s-t cut C with smallest weight in H. For each node i that is in S, we set xi=1. We set xi=-1 for every node i in T. The edges incident on s and t (edges of types 1 and 2) do not participate in C since they have weight infinity. Hence, the only edges used in C are incident on nodes with state equal to 0. Each edge in C of type 3 or 4 corresponds to one inconsistent edge in G. At most one edge in every pair of edges of type 5 can belong to C; by definition, the edge in the pair directed from a node in T to a node in S does not belong to the cut. Therefore each edge in C corresponds to exactly one inconsistent edge in G. Conversely, each inconsistent edge in G has exactly one corresponding edge in C. Therefore, since C has the smallest weight among all s-t cuts, this state assignment minimizes the total weight of inconsistent edges in G. To compute the minimum cut, we use the software developed by Cherkassky and Goldberg1 to implement the Golbderg-Tarjan algorithm for computing maximum flows and minimum cuts in a graph.2 This algorithm runs in O(nm log(n2/m)) time. The following figure depicts an example of this transformation. The graph on the top is G and the graph on the bottom is H. All edges in G have weight 1. The minimum cut in H is the edge connecting the node in state 0 to the node in state –1. Thus, the minimum cut algorithm assigns a state of 1 to both nodes in state 0. 1 0 1 0 1 -1 0 s -1 t 1 0 Figure: Transforming the functional linkage graph into a directed graph suitable for computing a minimum cut. Cross Validation Results We used the following datasets in our study. We obtained protein interactions for Saccharomyces cerevisiae from the General Repository of Interaction Datasets. Each protein interaction corresponds to an edge in the graph G. We obtained functional annotations for S. cerevisiae from the Gene Ontology (GO). We modified these annotations as in our earlier work:3 1. If a function f annotates a gene g, then we annotate g with every ancestor f’ of f in the directed acyclic graph defined by the parent-child relationships between the functions in GO. In other words, when we apply our algorithms to predict function f’, the node corresponding to gene g is in state 1. 2. If a function f is the most specific annotation for a gene g, i.e., no descendant f’ of f annotates g, then we set the annotation status of g with respect to each descendant f’ of f as unknown. In other words, when we apply our algorithms to predict function f’, the node corresponding to gene g is in state 0. We report results for 82 functions in the Gene Ontology that yielded high precision and recall in our previous study;3 proteins predicted to have these functions are good candidates for experimental validation. We obtained these functions as follows. In our earlier study, 3 we considered a subset of 1000 protein interactions in the GRID data set that were reported by at least two publications. We integrated this dataset with a compendium of gene expression profiles. We reported predicted annotations only for those functions than annotated no more than 100 proteins and had at least 75% precision and recall during cross-validation. There were 97 such functions. Since that time, the GO consortium has declared 15 of these functions to be “obsolete.” In this work, we report results for the remaining 82 functions. Note that in this work, we apply the min-cut and local rule algorithms to the network formed only by protein interactions; we do not integrate gene expression data with the protein interactions. We apply the two algorithms one function at a time. For each function f, we set the state of node to 1 if the corresponding protein is annotated with f, to –1 if the protein is annotated with a protein other than f, and to 0 otherwise. We use leave one out cross validation to test the methods. For each node in state 1, we set its state to 0, execute the min-cut algorithm and the guilt-byassociation algorithm, and separately record whether each algorithm correctly predicted the state of this node. We repeat this process for an equal number of nodes in state –1, chosen randomly from the set of all nodes in state –1. For each function, we use the same set of randomly selected nodes for both algorithms. For each function, we report cross-validation results using two measures of accuracy: precision, the ratio of the number of nodes correctly predicted to have state 1 to the total number of nodes predicted to have state 1, and recall, the ratio of the number of nodes correctly predicted to have state 1 to the number of nodes in state 1. The accompanying tab-delimited file contains these results. For seven functions, we obtain a precision and recall of 0. There are two reasons why our previous work reported high precision and recall for these functions: 1. we considered only a subset of 1000 protein interactions (those reported by at least two publications) . 2. we integrated gene expression data with these interactions. In this work, we consider the full set of protein interactions in the GRID dataset. We do not integrate gene expression data either. Therefore, our algorithms must cope with many spurious interactions that genome-scale experiments are known to report. As a result, we obtain poor cross validation results for some functions. Additional information and the data used for this analysis can be found at http://bioinformatics.cs.vt.edu/~murali/papers/art-of-gene-function-prediction.html. References 1. Cherkassky BV and Goldberg AV. On Implementing the Push-Relabel Method for the Maximum Flow Problem. Algorithmica, 19 (4): 390-410, 1997. Software available at http://www.avglab.com/andrew/soft.html. 2. Goldberg AV and Tarjan RE. A New Approach to the Maximum-Flow Problem. Journal of the ACM, 35(4): 921--940, 1988. 3. Karaoz U, Murali TM, Letovsky S, Zheng Y, Ding C, Cantor CR, and Kasif S. Whole genome annotation using evidence integration in functional linkage networks Proceedings of the National Academy of Sciences, pages 2888--2893, 2004.