Protein-Protein Inte.. - University of Notre Dame

advertisement
Predicting Protein-Protein Interactions from Protein
Domains Using a Set Cover Approach
Chengbang Huang1, Simon P.Kanaan1, Stefan Wuchty2, Danny Z. Chen1, Jesus A.
Izaguirre1
1
Computer Science and Engineering Department,
University of Notre Dame,
46556 Notre Dame, IN, USA
{CHuang1, SKanaan, Chen, Izaguirr}@nd.edu
2Physics Department,
University of Notre Dame,
46556 Notre Dame, IN, USA
SWuchty@nd.edu
Abstract. The goal of this project sets out to predict interactions between
proteins, with known structures, by inferring domain interactions from a
protein-protein network. Utilizing the Minimum Set Cover (MSC) approach
enables prediction of protein interactions with a higher specificity than other
methods using the same information while maintaining a high sensitivity.
Results have shown that the MSC method is superior to MLE and AM in terms
of average specificity and sensitivity values.
1 Introduction
1.1 The Function of a Protein
Protein interactions serve as the chemical basis for all living organisms. Proteins fold
and interact in intricate arrangements that provide functionality to the components of
a cell. These components work cooperatively to form whole body systems.
1.2 Significance of Protein-Protein Interactions
The primary sequence of proteins causes an inheritance of a multitude of polar and/or
non-polar characteristics that would otherwise prevent interactions to occur between
two proteins. Such an arrangement is undertaken to prevent aggregation of proteins
into useless blobs. However, a multitude of proteins associate with each other in
living organisms, suggestive of a specific geometric arrangement occurring to allow
interaction between two or more proteins. Further, a piece-wise assembly of multiple
proteins allows organisms to effectively and expediently detect and correct any
defects in a protein complex as opposed to analyzing a single large, awkward protein
for a minuscule error.
1.3 Domains as the Possible Cause of the Interaction
Large proteins are composed of different domains. In turn, these domains are
composed of distinct peptides and are the key to intricate arrangements that drive the
proteins to fold and interact as they do. A single protein molecule can possess
multiple domains causing difficulty in discovering a simple formula that dictates the
manner by which protein interactions occur. There is no known way to identify a
protein-protein interaction with a specific domain pair. Yet, certain affinities exist
between protein domains and are frequently seen in living organisms. This drives
researchers to extrapolate the mechanism of protein interactions by focusing on
domain interactions as a factor.
2 Related Work
InterDom is a database containing domain interactions. These interactions are derived
using four different methods. Looking at protein interactions is one of the methods.
They assign a domain interaction the probability (1/mn) where m is the number of
domains in the first interacting protein and n is the number of domains in the second
interacting protein. They domain interactions derived from each of the three methods
a confidence score. Then they sum up the scores so that a domain interaction
obtained using different methods would have a higher confidence score. The other
three methods are domain fusion which looks at interacting proteins in different
organisms. Observes if they are separate in one organism while fused into a single
protein chain in another. Another method is scientific literature where they use
scientific test mining techniques. The last method incorporated into InterDom is
protein complexes which looks at N protein complex where a domain interaction has
a minimum probability of: (1/(m*n*(N choose 2))
They are also concerned with false positives eliminating domain interactions with a
low confidence score, containing a uncommon “rare” domain which occurs only once,
or domains with more than 50 putative, accepted, interactions. They also gave 1.5 as
the cut-off score
MLE method is short for Maximum Likelihood Estimator. It chooses Domain
pairs which maximize the likelihood of an approximation. It is better than association
for any given specificity. The association method looks at probabilities of domain
pairs being the cause of interaction pairs. The probability is I mn/Nmn. Imn stands for
the number of interaction observed containing the domain while Nmn is the number of
interactions possible which is the number of proteins containing one domain pair
multiplied by the number of protein containing the other domain pair.
Another paper looks at Protein sequence signatures and predicts protein interactions.
Definition of correlated sequence signature: when a sequence signature appears for
frequent than random among interacting proteins. They claim this method is just an
aid to help decrease the search space for interacting proteins. One way they test is by
removing one protein pair at a time and seeing whether this pair is predicted or not.
They assumed it was predicted if there were still two other interacting proteins
containing the same sequence signature. Which gave them a 94% sensitivity. They
also ran it for 1 and got 97% sensitivity. They how ever do not calculate SP claiming
it we would be hard to estimate it until all of these interactions are tested
experimentally and validated or rejected. They claim that this approach has shown
them that this method of looking at sequence signatures is a valid method and looking
at correlated sequence signatures is a good guide for experimental methods.
One paper looks modeling p-p interactions using a graph where proteins are the
vertices and a protein interaction are the edges in the graph. They initialize the graph
by having each protein only interacting with one other protein and look at different
methods to connect the graph. They mention two ways of clustering the graph,
supervised and unsupervised. One way is without any domain knowledge they call it
unsupervised. The other is supervised and clusters nodes based on their biological
properties. They focus on the second approach. They can predict p-p interactions at
30% without any requirement of detectable sequence similarity of the query protein to
a protein of known structure. They mention a key difficulty is the fact the
characterizations of most biological systems are not complete. For testing they
generate three random sets of test nodes for which the SCOP, structural classification
of proteins, classification is known. Each set containing roughly 50 nodes, which is
approximately 1% of overall set, 4556, nodes. Using two sources yeast two hybrid
and database of interacting domains. As for the results for the three test sets they got
25-30% of assignments correct.
3 Methods & Implementation
The Minimum Set Cover Problem, MSC, is the problem of finding the minimum size
set of sets whose union is equal to the union of all the sets. We approximate the
minimum number of domain pairs, which are needed to explain all the protein
interactions read from the training data. A protein interaction, (P1, P2), is explained if
a domain pair, (D1, D2), is chosen such that P1 includes either D1 or D2 as one of its
domains, while P2 includes the other as one of its domains. The "Minimum Set Cover
Problem" is a NP complete problem which is why we approximate a solution instead
of finding the exact solution which is computationally expensive.
There are different algorithms to approximate MSC. one method to choose the
most interacting domain pair. This is the most common domain pair observed in
interacting proteins. Once this pair is chosen it covers the set of interacting proteins
which contain this domain pair. This set of protein pairs is assumed to interact due to
this domain pair and therefore will not need to be explained anymore. This process is
repeated until there are no more protein pairs to be explained. The union of all the
chosen sets is equal to the training data.
Another method chooses the domain pair with the greatest probability of
interacting. This probability of a domain pair (D1, D2) is the number of interacting
protein pairs containing (D1, D2) divided by the possible number of protein pairs
containing (D1, D2). The possible number of protein pairs containing (D1, D2) is
simply the number of proteins containing D1 multiplied by the number of proteins
containing D2.
The first method relies on the assumption that the most common observed
interacting domain pair among the protein interactions is probably the cause of the
protein interactions. The second method relies on the completeness and accuracy of
the training data. The closer the training data set is to completeness the closer the
calculated probability is to the actual probability.
3.1 Implementing Minimum Set Cover Method
3.2 Implementing Minimum Set Cover Method Using Probability
Before going in detail about MSC by probability lets take care of the input and the
data structures used to implement MSC. In order to use MSC you need to have a
training data set. This data set needs to include several protein interactions and the
protein structure of every protein involved in an interaction, in other words the
domains contained in the protein involved in an interaction. This data set needs to be
read in and stored in meaningful data structures. One such data structure is a vector
of linked lists. Three such data structures are recommended. The size of the first
vector is the number of proteins available. Every protein has a unique id where the id
is the index in the vector. Every protein also has a linked list of domains. This list
contains all the domains this protein contains. The second data structure is similar to
the first but instead of each protein containing a list of the domains it contains, it
contains a list of the proteins it interacts with. The third data structure is to aid in
finding the proteins which host a domain. The number of domains is the vector’s
size. Every domain has a unique id where the id is the index in the vector. Every
domain also has a linked list of proteins. This list contains all the proteins which host
this domain.
A data structure is also needed in order to choose the domain pair depending on
your weight function. One such data structure is a domain-domain matrix. The
matrix’s size is n x n, where n is the number of domains available in the training data.
Each node, Dij in the matrix represents the probability that the domain pair, (Di, Dj), is
the cause of protein pair containing (Di, Dj) interacting. For a protein pair to contain
(Di, Dj) one of the proteins needs to contain either Di or Dj while the other protein
contains the other. The probability of (Di, Dj) causing the protein-protein interaction
is equal to:
(# interacting proteins containing (Di, Dj)) / (# protein pairs containing (Di, Dj)) .
Instead of storing a double it is simpler to store two integers, numerator and
denominator since this number needs to be updated. Observe the pseudo code.
After all the training data is read in and this matrix is built it is time to approximate
MSC. Recall for MSC by probability the domain pair with the greatest probability of
(1)
interacting is chosen and used to explain a set of protein pairs. This is repeated until
there are no more protein pairs to explain. Pseudo code highlighting these steps
follows:
MSC by probability (Training Data)
Maximum := largest number in Domain matrix;
Di := row index of largest number in Domain matrix;
Dj := column index of largest number in Domain matrix;
while maximum ≠ 0
begin
For every protein Pi which hosts Di
For every protein Pj which hosts Dj
Remove affect of (Pi, Pj) from Domain matrix
end
end
Remove affect of (Pi, Pj) from Domain matrix
begin
For every domain Dx contained in Pi
For every domain Dy contained in Pj
Decrement Dxy numerator by one;
Decrement Dxy denominator by one;
end
end
end
end.
3.3 Method’s Significance
MSC minimizes the number of domain pairs which will be used for predicting protein
interactions by choosing the minimum set of domain pairs which explain all the
protein interactions in the training data. The number or predicted protein pairs
decreases by minimizing the number of domain pairs chosen. This decreases the
number of false positive interactions. False positive interactions are protein
interactions which are predicted but not contained in the test data set. However
enough domain pairs are chosen to explain the training data set and therefore these
domain pairs can predict at least all the protein pairs in the training data set. Any
additional protein pairs which are predicted are less likely to be false positives as will
be seen through experimentation. Two metrics are used to measure how good the
protein predictions are, specificity and sensitivity.
Specificity = # matches / # predicted protein interactions .
(2)
Sensitivity = # matches / # protein interactions in test data set .
(3)
The number of matches is the number of predicted protein interactions which are
included in the testing data set. MSC aims to maximize specificity while maintaining
a high sensitivity.
3.5 Prediction
Once MSC is finished choosing all the domain pairs needed to explain the training
data, these domain pairs predict protein interactions. The probability of each
observed protein pair is calculated. If the probability is greater than some threshold
this protein pair is predicted to be interacting. The number of matches between the
predicted data set and observed data set is increased by one. On the other hand if the
probability is less than some threshold the protein pair is predicted not to interact.
This case increases the number of false negatives by one.
The following pseudo code tests whether an observed interacting protein pair is
predicted to interact or not:
Prediction (observed data, Domain matrix)
For every observed protein Pi
For every observed protein Pj interacting with Pi
Predict the probability(Pi, Pj)
If probability (Pi, Pj) > threshold
(Pi, Pj) is predicted to interact;
increment number of matches by one;
Else
(Pi, Pj) is not predicted to interact;
increment number of false_negatives by one;
end
end
Predict the probability(Pi, Pj)
begin
non_interaction := 1.0;
For every domain Di contained in Pi
For every domain Dj contained in Pj
Dij := see equation 1;
non_interaction := non_interaction_prob * Dij;
end
end
return 1 - non_interaction;
end.
This algorithm can easily be modified to predict all protein interactions given a set
of proteins and their structure. Instead of looking at observed protein interactions as
in the first for loop, look at all possible protein interactions. As for the “if statement”
instead of incrementing the number of matches you increment the number of
predicted protein interactions.
Protein- Protein
Interactions
(Test Set)
Protein
(Test Set)
Protein-Protein
interactions
(Training)
Domain-Domain
interactions
MSC
Protein
structure
(Swiss PFAM)
Protein-Protein
interactions
Prediction
Metrics
(Specificity,
Sensitivity, etc.)
Analysis
Topological
Analysis
4 Results
4.1 Comparison between AM, MLE, MSC, and MSC by Probability
4.2 Comparison with Other Methods
5 Future Work
Any number of different weight functions could be implemented where different
weight functions could be more meaningful given the data used. Some weight
functions could rely more on the presence of protein interactions, others on the
absence of protein interactions, while others could take both into consideration, for
example MSC with probability. MSC can also incorporate different methods which
rely on domain interaction by using these different methods as weight functions. It
can be used to aid the different methods which use multiple sources to determine
domain interaction such as InterDom.
Different assumptions could also be made. Instead of looking at a domain pair as
the cause of interactions, consider groups of domains interacting with other groups, or
consider interactions which are dependent on previous interactions, protein
interactions are not independent.
Some work could also be done on the optimization of MSC. It currently uses a
matrix which contains a significant number of wasted space, and it is costly to find the
largest number in matrix every time you are searching for a domain pair. A heap data
structure could be implemented which minimizes these expenses.
6 References
Download