Predicting Protein-Protein Interactions from Protein Domains Using a Set Cover Approach Chengbang Huang1, Simon P.Kanaan1, Stefan Wuchty2, Danny Z. Chen1, Jesus A. Izaguirre1 1 Computer Science and Engineering Department, University of Notre Dame, 46556 Notre Dame, IN, USA {CHuang1, SKanaan, Chen, Izaguirr}@nd.edu 2Physics Department, University of Notre Dame, 46556 Notre Dame, IN, USA SWuchty@nd.edu Abstract. The goal of this project sets out to predict interactions between proteins, with known structures, by inferring domain interactions from a protein-protein network. Utilizing the Minimum Set Cover (MSC) approach enables prediction of protein interactions with a higher specificity than other methods using the same information while maintaining a high sensitivity. Results have shown that the MSC method is superior to MLE and AM in terms of average specificity and sensitivity values. 1 Introduction 1.1 The Function of a Protein Protein interactions serve as the chemical basis for all living organisms. Proteins fold and interact in intricate arrangements that provide functionality to the components of a cell. These components work cooperatively to form whole body systems. 1.2 Significance of Protein-Protein Interactions The primary sequence of proteins causes an inheritance of a multitude of polar and/or non-polar characteristics that would otherwise prevent interactions to occur between two proteins. Such an arrangement is undertaken to prevent aggregation of proteins into useless blobs. However, a multitude of proteins associate with each other in living organisms, suggestive of a specific geometric arrangement occurring to allow interaction between two or more proteins. Further, a piece-wise assembly of multiple proteins allows organisms to effectively and expediently detect and correct any defects in a protein complex as opposed to analyzing a single large, awkward protein for a minuscule error. 1.3 Domains as the Possible Cause of the Interaction Large proteins are composed of different domains. In turn, these domains are composed of distinct peptides and are the key to intricate arrangements that drive the proteins to fold and interact as they do. A single protein molecule can possess multiple domains causing difficulty in discovering a simple formula that dictates the manner by which protein interactions occur. There is no known way to identify a protein-protein interaction with a specific domain pair. Yet, certain affinities exist between protein domains and are frequently seen in living organisms. This drives researchers to extrapolate the mechanism of protein interactions by focusing on domain interactions as a factor. 2 Related Work InterDom is a database containing domain interactions. These interactions are derived using four different methods. Looking at protein interactions is one of the methods. They assign a domain interaction the probability (1/mn) where m is the number of domains in the first interacting protein and n is the number of domains in the second interacting protein. They domain interactions derived from each of the three methods a confidence score. Then they sum up the scores so that a domain interaction obtained using different methods would have a higher confidence score. The other three methods are domain fusion which looks at interacting proteins in different organisms. Observes if they are separate in one organism while fused into a single protein chain in another. Another method is scientific literature where they use scientific test mining techniques. The last method incorporated into InterDom is protein complexes which looks at N protein complex where a domain interaction has a minimum probability of: (1/(m*n*(N choose 2)) They are also concerned with false positives eliminating domain interactions with a low confidence score, containing a uncommon “rare” domain which occurs only once, or domains with more than 50 putative, accepted, interactions. They also gave 1.5 as the cut-off score MLE method is short for Maximum Likelihood Estimator. It chooses Domain pairs which maximize the likelihood of an approximation. It is better than association for any given specificity. The association method looks at probabilities of domain pairs being the cause of interaction pairs. The probability is I mn/Nmn. Imn stands for the number of interaction observed containing the domain while Nmn is the number of interactions possible which is the number of proteins containing one domain pair multiplied by the number of protein containing the other domain pair. Another paper looks at Protein sequence signatures and predicts protein interactions. Definition of correlated sequence signature: when a sequence signature appears for frequent than random among interacting proteins. They claim this method is just an aid to help decrease the search space for interacting proteins. One way they test is by removing one protein pair at a time and seeing whether this pair is predicted or not. They assumed it was predicted if there were still two other interacting proteins containing the same sequence signature. Which gave them a 94% sensitivity. They also ran it for 1 and got 97% sensitivity. They how ever do not calculate SP claiming it we would be hard to estimate it until all of these interactions are tested experimentally and validated or rejected. They claim that this approach has shown them that this method of looking at sequence signatures is a valid method and looking at correlated sequence signatures is a good guide for experimental methods. One paper looks modeling p-p interactions using a graph where proteins are the vertices and a protein interaction are the edges in the graph. They initialize the graph by having each protein only interacting with one other protein and look at different methods to connect the graph. They mention two ways of clustering the graph, supervised and unsupervised. One way is without any domain knowledge they call it unsupervised. The other is supervised and clusters nodes based on their biological properties. They focus on the second approach. They can predict p-p interactions at 30% without any requirement of detectable sequence similarity of the query protein to a protein of known structure. They mention a key difficulty is the fact the characterizations of most biological systems are not complete. For testing they generate three random sets of test nodes for which the SCOP, structural classification of proteins, classification is known. Each set containing roughly 50 nodes, which is approximately 1% of overall set, 4556, nodes. Using two sources yeast two hybrid and database of interacting domains. As for the results for the three test sets they got 25-30% of assignments correct. 3 Methods & Implementation The Minimum Set Cover Problem, MSC, is the problem of finding the minimum size set of sets whose union is equal to the union of all the sets. We approximate the minimum number of domain pairs, which are needed to explain all the protein interactions read from the training data. A protein interaction, (P1, P2), is explained if a domain pair, (D1, D2), is chosen such that P1 includes either D1 or D2 as one of its domains, while P2 includes the other as one of its domains. The "Minimum Set Cover Problem" is a NP complete problem which is why we approximate a solution instead of finding the exact solution which is computationally expensive. There are different algorithms to approximate MSC. one method to choose the most interacting domain pair. This is the most common domain pair observed in interacting proteins. Once this pair is chosen it covers the set of interacting proteins which contain this domain pair. This set of protein pairs is assumed to interact due to this domain pair and therefore will not need to be explained anymore. This process is repeated until there are no more protein pairs to be explained. The union of all the chosen sets is equal to the training data. Another method chooses the domain pair with the greatest probability of interacting. This probability of a domain pair (D1, D2) is the number of interacting protein pairs containing (D1, D2) divided by the possible number of protein pairs containing (D1, D2). The possible number of protein pairs containing (D1, D2) is simply the number of proteins containing D1 multiplied by the number of proteins containing D2. The first method relies on the assumption that the most common observed interacting domain pair among the protein interactions is probably the cause of the protein interactions. The second method relies on the completeness and accuracy of the training data. The closer the training data set is to completeness the closer the calculated probability is to the actual probability. 3.1 Implementing Minimum Set Cover Method 3.2 Implementing Minimum Set Cover Method Using Probability Before going in detail about MSC by probability lets take care of the input and the data structures used to implement MSC. In order to use MSC you need to have a training data set. This data set needs to include several protein interactions and the protein structure of every protein involved in an interaction, in other words the domains contained in the protein involved in an interaction. This data set needs to be read in and stored in meaningful data structures. One such data structure is a vector of linked lists. Three such data structures are recommended. The size of the first vector is the number of proteins available. Every protein has a unique id where the id is the index in the vector. Every protein also has a linked list of domains. This list contains all the domains this protein contains. The second data structure is similar to the first but instead of each protein containing a list of the domains it contains, it contains a list of the proteins it interacts with. The third data structure is to aid in finding the proteins which host a domain. The number of domains is the vector’s size. Every domain has a unique id where the id is the index in the vector. Every domain also has a linked list of proteins. This list contains all the proteins which host this domain. A data structure is also needed in order to choose the domain pair depending on your weight function. One such data structure is a domain-domain matrix. The matrix’s size is n x n, where n is the number of domains available in the training data. Each node, Dij in the matrix represents the probability that the domain pair, (Di, Dj), is the cause of protein pair containing (Di, Dj) interacting. For a protein pair to contain (Di, Dj) one of the proteins needs to contain either Di or Dj while the other protein contains the other. The probability of (Di, Dj) causing the protein-protein interaction is equal to: (# interacting proteins containing (Di, Dj)) / (# protein pairs containing (Di, Dj)) . Instead of storing a double it is simpler to store two integers, numerator and denominator since this number needs to be updated. Observe the pseudo code. After all the training data is read in and this matrix is built it is time to approximate MSC. Recall for MSC by probability the domain pair with the greatest probability of (1) interacting is chosen and used to explain a set of protein pairs. This is repeated until there are no more protein pairs to explain. Pseudo code highlighting these steps follows: MSC by probability (Training Data) Maximum := largest number in Domain matrix; Di := row index of largest number in Domain matrix; Dj := column index of largest number in Domain matrix; while maximum ≠ 0 begin For every protein Pi which hosts Di For every protein Pj which hosts Dj Remove affect of (Pi, Pj) from Domain matrix end end Remove affect of (Pi, Pj) from Domain matrix begin For every domain Dx contained in Pi For every domain Dy contained in Pj Decrement Dxy numerator by one; Decrement Dxy denominator by one; end end end end. 3.3 Method’s Significance MSC minimizes the number of domain pairs which will be used for predicting protein interactions by choosing the minimum set of domain pairs which explain all the protein interactions in the training data. The number or predicted protein pairs decreases by minimizing the number of domain pairs chosen. This decreases the number of false positive interactions. False positive interactions are protein interactions which are predicted but not contained in the test data set. However enough domain pairs are chosen to explain the training data set and therefore these domain pairs can predict at least all the protein pairs in the training data set. Any additional protein pairs which are predicted are less likely to be false positives as will be seen through experimentation. Two metrics are used to measure how good the protein predictions are, specificity and sensitivity. Specificity = # matches / # predicted protein interactions . (2) Sensitivity = # matches / # protein interactions in test data set . (3) The number of matches is the number of predicted protein interactions which are included in the testing data set. MSC aims to maximize specificity while maintaining a high sensitivity. 3.5 Prediction Once MSC is finished choosing all the domain pairs needed to explain the training data, these domain pairs predict protein interactions. The probability of each observed protein pair is calculated. If the probability is greater than some threshold this protein pair is predicted to be interacting. The number of matches between the predicted data set and observed data set is increased by one. On the other hand if the probability is less than some threshold the protein pair is predicted not to interact. This case increases the number of false negatives by one. The following pseudo code tests whether an observed interacting protein pair is predicted to interact or not: Prediction (observed data, Domain matrix) For every observed protein Pi For every observed protein Pj interacting with Pi Predict the probability(Pi, Pj) If probability (Pi, Pj) > threshold (Pi, Pj) is predicted to interact; increment number of matches by one; Else (Pi, Pj) is not predicted to interact; increment number of false_negatives by one; end end Predict the probability(Pi, Pj) begin non_interaction := 1.0; For every domain Di contained in Pi For every domain Dj contained in Pj Dij := see equation 1; non_interaction := non_interaction_prob * Dij; end end return 1 - non_interaction; end. This algorithm can easily be modified to predict all protein interactions given a set of proteins and their structure. Instead of looking at observed protein interactions as in the first for loop, look at all possible protein interactions. As for the “if statement” instead of incrementing the number of matches you increment the number of predicted protein interactions. Protein- Protein Interactions (Test Set) Protein (Test Set) Protein-Protein interactions (Training) Domain-Domain interactions MSC Protein structure (Swiss PFAM) Protein-Protein interactions Prediction Metrics (Specificity, Sensitivity, etc.) Analysis Topological Analysis 4 Results 4.1 Comparison between AM, MLE, MSC, and MSC by Probability 4.2 Comparison with Other Methods 5 Future Work Any number of different weight functions could be implemented where different weight functions could be more meaningful given the data used. Some weight functions could rely more on the presence of protein interactions, others on the absence of protein interactions, while others could take both into consideration, for example MSC with probability. MSC can also incorporate different methods which rely on domain interaction by using these different methods as weight functions. It can be used to aid the different methods which use multiple sources to determine domain interaction such as InterDom. Different assumptions could also be made. Instead of looking at a domain pair as the cause of interactions, consider groups of domains interacting with other groups, or consider interactions which are dependent on previous interactions, protein interactions are not independent. Some work could also be done on the optimization of MSC. It currently uses a matrix which contains a significant number of wasted space, and it is costly to find the largest number in matrix every time you are searching for a domain pair. A heap data structure could be implemented which minimizes these expenses. 6 References