Chengbang Huang 1 , Simon P.Kanaan
1 , Stefan Wuchty
Izaguirre 1
2 , Danny Z. Chen 1 , Jesus A.
1 Computer Science and Engineering Department,
University of Notre Dame,
46556 Notre Dame, IN, USA
{CHuang1, SKanaan, Chen, Izaguirr}@nd.edu
2 Physics Department,
University of Notre Dame,
46556 Notre Dame, IN, USA
SWuchty@nd.edu
Abstract.
Protein interactions serve as the chemical basis for all living organisms. Proteins fold and interact in intricate arrangements that provide functionality to the components of a cell. These components work cooperatively to form whole body systems. Proteins are composed of different domains. Domains are composed of distinct peptides and are the key to intricate arrangements that drive the proteins to fold and interact as they do. A single protein molecule can possess multiple domains causing difficulty in discovering a simple formula that dictates the manner by which protein interactions occur. There is no known way to identify a protein-protein interaction with a specific domain pair. Yet, certain affinities exist between protein domains and are frequently seen in living organisms. This drives researchers to extrapolate the mechanism of protein interactions by focusing on domain interactions as a factor. A method capable of predicting accurate protein interactions gives function to proteins which have not been tested yet, and helps researchers understand the underlying biological network. Minimum
Set Cover, MSC, approach is able to predict protein interactions with a higher specificity than other methods using the same information while maintaining a high sensitivity. MSC could also be used to aid any existing method which uses domains by cutting down the number of protein interactions by removing unnecessary domain interactions. This allows these methods to increase their specificity while maintaining a high sensitivity.
brief intro/Goal/assumptions
The model system used for these proceedings is the yeast cell, with several of its proteins serving as the test cases. The data used for training and testing our methods
are databases, PFAMA and PFAMB, which contain protein-protein interactions and protein structures, domains contained in a protein.
1.1 The Function of a Protein
A protein is one of the basic building blocks of all living organisms. Proteins are typically comprised of an intricate sequence involving a variation of twenty common amino acids, which are molecules that, in turn, consist of a single carbon molecule flanked by four different attached molecular functional groups. Proteins provide evidence for the occurrence of evolution and function to maintain those processes that keep organisms alive, often playing major roles in the catalysis of chemical reactions that control the delicate homeostatic balance. Fundamental life processes such as oxygen transport throughout the body of an organism, filtration of molecules seeking passage through a cell membrane, and duplication of nucleic acids during reproduction all involve proteins.
1.2 Significance of Protein-Protein Interactions
The amino acids of proteins develop extremely durable peptide bonds between the carbonyl carbon and the amide nitrogen to self-assemble into polypeptide chains, serving as the primary protein structure and driving the way a protein folds into the secondary and tertiary structures through the interaction between attached functional groups.
1.3 Domains as the Possible Cause of the Interaction
The primary sequence of proteins causes an inheritance of a multitude of polar and/or non-polar characteristics that would otherwise prevent interactions to occur between two proteins. Such an arrangement is undertaken to prevent aggregation of proteins into useless blobs. However, a multitude of proteins associate with each other in living organisms, suggestive of a specific geometric arrangement occurring to allow
interaction between two or more proteins. Further, a piece-wise assembly of multiple proteins allows organisms to effectively and expediently detect and correct any defects in a protein complex as opposed to analyzing a single large, awkward protein for a miniscule error.
AM, MLE, InterDom etc…
The Minimum Set Cover Problem, MSC, is the problem of finding the minimum size set of sets whose union is equal to the union of all the sets. We approximate the minimum number of domain pairs, which are needed to explain all the protein interactions read from the training data. A protein interaction, (P a domain pair, (D
1
, D
2
1
, P
2
) , is explained if
) , is chosen such that P
1
includes either D
1
or D
2
as one of its domains, while P
2
includes the other as one of its domains. The "Minimum Set Cover
Problem" is a NP complete problem which is why we approximate a solution instead of finding the exact solution which is computational expensive.
There are different methods to approximate MSC, one method to choose the most interacting domain pair. This is the most common domain pair observed in interacting proteins. Once this pair is chosen it covers the set of interacting proteins which contain this domain pair. This set of protein pairs is assumed to interact due to this domain pair and therefore will not need to be explained anymore. This process is repeated until there are no more protein pairs to be explained. The union of all the chosen sets is equal to the training data.
Another method chooses the domain pair with the greatest probability of interacting. This probability of a domain pair (D
1
, D
2
) is the number of interacting protein pairs containing containing containing
(D
D
2
.
1
, D
2
(D
1
, D
2
) divided by the possible number of protein pairs
) . The possible number of protein pairs containing (D simply the number of proteins containing D
1
, D
2
) is
1
multiplied by the number of proteins
The first method relies on the assumption that the most common observed interacting domain pair among the protein interactions is probably the cause of the protein interactions. The second method relies on the completeness and accuracy of the training data. The closer the training data set is to completeness the closer the calculated probability is to the actual probability.
3.1 Implementing Minimum Set Cover Method
Implementing MSC using the number of interactions observed as the weight function, the method to go about choosing the approximation of the minimum number of sets, is similar to MSC using the domain-pairs probability. Instead of dealing with both the numerator and denominator as in MSC by probability just deal with the numerator, the number of interacting protein-pairs.
3.2 Implementing Minimum Set Cover Method Using Probability
Before going in detail about MSC by probability lets take care of the input and the data structures used to implement MSC. In order to use MSC you need to have a training data set. This data set needs to include several protein interactions and the protein structure of every protein involved in an interaction, in other words the domains contained in the protein involved in an interaction. This data set needs to be read in and stored in meaningful data structures. One such data structure is a vector
of linked lists. Three such data structures are recommended. The size of the first vector is the number of proteins available. Every protein has a unique id where the id is the index in the vector. Every protein also has a linked list of domains. This list contains all the domains this protein contains. The second data structure is similar to the first but instead of each protein containing a list of the domains it contains, it contains a list of the proteins it interacts with. The third data structure is to aid in finding the proteins which host a domain. The number of domains is the vector’s size. Every domain has a unique id where the id is the index in the vector. Every domain also has a linked list of proteins. This list contains all the proteins which host this domain.
A data structure is also needed in order to choose the domain pair depending on your weight function. One such data structure is a domain-domain matrix. The matrix’s size is n x n , where n is the number of domains available in the training data.
Each node, the cause of protein pair containing (D
(D i
, D j
D ij
in the matrix represents the probability that the domain pair, (D i
, D j
) one of the proteins needs to contain either D i
or D j i
, D j
) , is
) interacting. For a protein pair to contain
while the other protein contains the other. The probability of (D i is equal to:
, D j
) causing the protein-protein interaction
(# interacting proteins containing (D i
, D j
) ) / (# protein pairs containing (D i
, D j
) ) .
( 1 )
Instead of storing a double it is simpler to store two integers, numerator and denominator since this number needs to be updated. Observe the pseudo code.
After all the training data is read in and this matrix is built it is time to approximate
MSC. Recall for MSC by probability the domain pair with the greatest probability of interacting is chosen and used to explain a set of protein pairs. This is repeated until there are no more protein pairs to explain. Pseudo code highlighting these steps follows:
MSC by probability (Training Data)
Maximum := largest number in Domain matrix;
D i
:= row index of largest number in Domain matrix;
D j
:= column index of largest number in Domain matrix;
while maximum ≠ 0
begin
For every protein P i which hosts D i
For every protein P j which hosts D j
Remove affect of (P i
, P j
) from Domain matrix
end
end
Remove affect of (P i
, P j
) from Domain matrix
begin
For every domain D x
contained in P i
For every domain D y
contained in P j
Decrement D xy
numerator by one;
Decrement D xy
denominator by one;
end
end
end
end.
3.3 Method’s Significance
MSC minimizes the number of domain pairs which will be used for predicting protein interactions by choosing the minimum set of domain pairs which explain all the protein interactions in the training data. The number or predicted protein pairs decreases by minimizing the number of domain pairs chosen. This decreases the number of false positive interactions. False positive interactions are protein interactions which are predicted but not contained in the test data set. However enough domain pairs are chosen to explain the training data set and therefore these domain pairs can predict at least all the protein pairs in the training data set. Any additional protein pairs which are predicted are less likely to be false positives as will be seen through experimentation. Two metrics are used to measure how good the protein predictions are, specificity and sensitivity.
Specificity = # matches / # predicted protein interactions . ( 2 )
Sensitivity = # matches / # protein interactions in test data set . ( 3 )
The number of matches is the number of predicted protein interactions which are included in the testing data set. MSC aims to maximize specificity while maintaining a high sensitivity.
3.5 Prediction
Once MSC is finished choosing all the domain pairs needed to explain the training data, these domain pairs predict protein interactions. The probability of each observed protein pair is calculated. If the probability is greater than some threshold this protein pair is predicted to be interacting. The number of matches between the predicted data set and observed data set is increased by one. On the other hand if the probability is less than some threshold the protein pair is predicted not to interact.
This case increases the number of false negatives by one.
The following pseudo code tests whether an observed interacting protein pair is predicted to interact or not:
Prediction (observed data, Domain matrix)
For every observed protein P i
For every observed protein P j
interacting with P i
Predict the probability (P i
, P j
)
If probability (P i
, P j
) > threshold
(P i
, P j
) is predicted to interact; increment number of matches by one;
Else
(P i
, P j
) is not predicted to interact;
increment number of false_negatives by one;
end
end
Predict the probability (P i
, P j
)
begin
non_interaction := 1.0;
For every domain D i
contained in P i
For every domain D j
contained in P j
D ij
:= see equation 1;
non_interaction := non_interaction_prob * D ij
;
end
end
return 1 - non_interaction;
end.
This algorithm can easily be modified to predict all protein interactions given a set of proteins and their structure. Instead of looking at observed protein interactions as in the first for loop, look at all possible protein interactions. As for the “if statement” instead of incrementing the number of matches you increment the number of predicted protein interactions.
4.1 Comparison between AM, MLE, MSC, and MSC by Probability
4.2 Comparison with Other Methods
Any number of different weight functions could be implemented where different weight functions could be more meaningful given the data used. Some weight functions could rely more on the presence of protein interactions, others on the absence of protein interactions, while others could take both into consideration, for example MSC with probability. MSC can also incorporate different methods which rely on domain interaction by using these different methods as weight functions. It can be used to aid the different methods which use multiple sources to determine domain interaction such as InterDom.
Different assumptions could also be made. Instead of looking at a domain pair as the cause of interactions, consider groups of domains interacting with other groups, or consider interactions which are dependent on previous interactions, protein interactions are not independent.
Some work could also be done on the optimization of MSC. It currently uses a matrix which contains a significant number of wasted space, and it is costly to find the largest number in matrix every time you are searching for a domain pair. A heap data structure could be implemented which minimizes these expenses.