Protein-Protein Inte..

advertisement
Introduction/Biology:
Protein-protein interactions serve as the chemical basis of all living organisms.
Derived from the nuclear material within a cell, proteins fold and interact in intricate
arrangements that provide functionality to the components of a cell, which in turn work
cooperatively to form whole body systems. Protein domains, which are significant
portions of proteins composed of distinct peptides, are the key to such intricate
arrangements and drive the proteins to fold and interact as they do. A single protein
molecule can possess multiple domains, causing difficulty in discovering a simple
formula that dictates the manner by which protein-protein interactions occur. Yet, certain
affinities exist between certain protein domains and are frequently seen in living
organisms. This drives our research that seeks to extrapolate the mechanism of proteinprotein interactions to focus on domain-domain interactions as a factor. The model
system used for these proceedings is the yeast cell, with several of its proteins serving as
the test cases.
There are protein family data banks available, which contain information about
certain protein structures, i.e. which domains are in a certain protein and which proteins
have been found to interact using experimentation.
We approximate the minimum number of domain pairs, which are needed to
explain all the protein interactions read from the family data bank. A protein interaction,
(P1, P2), is explained if a domain pair, (D1, D2), is chosen such that P1 includes either
D1 or D2 as one of its domains, while P2 includes the other as one of its domains. This is
known as the "Minimum Set Cover Problem", which is the problem of finding the
minimum size set of sets whose union is equal to the union of all the sets. The
"Minimum Set Cover Problem" in a NP complete problem. We will show a proof
showing that we are dealing with the "Minimum Set Cover Problem" later in the paper.
Explain why our method is a good predictor.
Show how our algorithm is different, show in results section comparison with
other algorithms out there.
Algorithm:
An algorithm has been implemented in order to find an approximation to this
minimum set. For this algorithm to work efficiently and produce a good approximation,
it needs to be able to choose the domain pairs in an educated, not a randomized manner.
This educated way can be done using weight functions. Where each domain pair is given
a weight, and the largest of the weights is chosen. Details on the distribution/calculation
of weights will follow later in paper.
Four different weight functions have been developed and implemented for the
purposes of this project. This base algorithm consists of functions that can record the
protein structure and interaction information and store them into different data structures.
It also builds a domain-domain matrix. This matrix holds information about interacting
domains. Each entry in the matrix represents the number of times domains Di and Dj
were observed as the possible cause in different protein-protein interactions. For
example, lets assume proteins P1 and P2 interact. P1 contains domains D1, D2 and D3
while P2 contains domains D1 and D5. This interaction would cause six entries in the
matrix to be incremented by one each. These entries correspond to domain pairs (D1,
D1), (D1, D5), (D2, D1), (D2, D5), (D3, D1) and (D3, D5). This process continues on
until there are no more protein - protein interactions to be observed. Then the algorithm
approximates the minimum set of domains pairs, which explains all the protein
interactions using one of the four different weight functions.
First Weight Function:
The first weight function relies on the assumption that the most common observed
interacting domain pair among the protein interactions is probably the cause of the
protein interactions.
This weight function explains all the p-p interactions by finding the most common
pair of domains present among p-p interactions. Then essentially removing this d-d pair
and all the p-p interactions from the data being observed then it goes through the whole
process again until there are no more p-p interactions to be observed.
It initializes the domain-domain matrix by reading in the information from all the
protein interactions available to it as explained in the algorithm. It chooses the greatest
element in the matrix, which represents the most common domain pair, (Di, Dj), present
among interacting proteins. Since domain pair, (Di, Dj), is chosen the matrix is then
modified. The algorithm goes through all the protein interactions, which could have been
caused by (Di, Dj) and essentially undoes any influence these protein interactions had on
our domain matrix by going through each possible domain pair combination in each
protein pair and decrementing its corresponding element in the matrix by one.
Second Weight Function:
The second weight function is very similar to the first but instead of finding the
most interacting d-d pair it finds the most interacting domain present among the p-p
interactions and then performs the same processes as the first weight function..
The second weight function initializes the matrix by reading in the information
from all the protein interactions available to it as explained in the algorithm. It also
creates a vector, sum_vector, with its size being the number of rows in the matrix. Then
it goes about summing up each row in the matrix and stores the value in the
corresponding value in the sum_vector. For example, after the weight function sums up
row i in the domain-domain matrix it stores that value in the ith entry of the vector. Then
the weight function finds the maximum element in the sum_vector and then returns the
maximum element in the corresponding row of the domain-domain matrix. Now just like
the first weight function, since domain pair, (Di, Dj), is chosen the matrix is then
modified. The algorithm goes through all the protein interactions, which could have been
caused by (Di, Dj) and essentially undoes any influence these protein interactions had on
our domain matrix by going through each possible domain pair combination in each
protein pair and decrementing its corresponding element in the matrix by one.
Third Weight Function:
The third weight function tries to incorporate the absence of p-p interactions. For
example, if the domain pair (Di, Dj) is observed to be the most interacting, but this is
because it they are the most common domains present. There might be many proteins
which contain these domains but do not interact. We should take this into affect. To
incorporate this idea we will initialize our matrix just like the first weight function. Then
we go through every element in the matrix and divide the number there which represents
the number of d-d interactions observed among p-p interactions and divide each number
by the total number of proteins that contain the first domain times the number of proteins
which contain the second domain. By doing this each element now represents the
probability that domains i and j interact. Then the weight function goes about choosing
the highest probability in the matrix, seeing which proteins this domain pair explains,
remove these proteins influence from the data and then performing the same tasks again.
Fourth Weight Function:
The fourth weight function assumes that the absence of p-p edges is because the
data is not complete not because these proteins do not interact. So if a domain pair is
present among a lot of the proteins observed. Only the interacting proteins will be taken
into account. In the d-d matrix each entry represents the probability of domain i
interacting with domain j just like the third weight function. But the probability is now
calculated by subtracting the probability that domain i,j is not the cause of the interaction
from one.
Complexity:
Now these algorithms are implemented using matrices composed of vectors. This
data structure is costly when it comes to memory management. It stores a lot of zeroes,
which the first 4 algorithms do not obtain any additional knowledge from. It also is not
efficient when it comes to finding the maximum element. However, it is able to find or
modify any element very efficiently.
Some other possibilities:
A- Matrix (vector of adjacency lists)
PROS:
fast to find a domain[i] but a little slower to find a max combination with domain[j],
faster than C but slower than A
easy to update adjacency list
CONS:
still stores some zeros, more than C but less than A
unless we sort lists finding max will be costly
B- Matrix (vector of heaps)
PROS:
also fast to find domain[i] and even fast to find max
CONS:
will need to update heap every time max is chosen
will also store some zeros like B
will be a little more memory costly than B but less than A
C- Matrix (a heap of adjacency lists)
PROS:
very memory efficient
CONS:
will be hard to find a certain domain, well more costly.
We will also be implementing our algorithm using a matrix, which is composed of a heap
of adjacency lists since we believe it will be more desirable when it comes to certain
cases where speed is a factor.
Testing:
Five different test cases were created to test the algorithm to make sure our algorithm is
doing what we think it is. (more added once testing is finished)
Work in Progress:
As of now we are looking at other weight functions and finding different methods
to compare them. One comparison to be made is to see which weight function needs the
smallest number of domain pairs to explain the data. Another comparison can be made
by seeing how many positive, negative, false positive, and false negative p-p interactions
can be predicted in a set of data other than our training data and comparing these various
numbers. Now this new set of data could be composed of our training data, disjoint,
related to our training data, or not.
-A composed data set would imply that learning data set is included among other
observed protein interactions.
-A disjoint data set implies that none of the protein interactions used in the training data
are present in this new data set.
-A related data set to the training data set could be p-p interactions observed in the same
type of biological organism, same organ etc...
- A unrelated data set could just be a different data set composed of p-p observations
made from a different type of organism, different organ etc...
We also plan on comparing our results with different results of existing programs
by replicating their method for results using our algorithm. Some of the papers we plan
on comparing with are:
.Paper
.Paper
.Paper
"Need to figure out which papers we are comparing with"
. etc...
.Paper
The p-p database that we have been testing our different weight functions is
composed of approximately 2300 p-p interactions. There exists a more up to date version
of this data base which consists of approximately 12000 p-p interactions. We will be
testing on this database pending an increase in the size of my AFS space and some
optimization on the data structures in the algorithm, if the present algorithm is taking too
much time to run.
Some other weight functions that will be implemented:
Weight function 5: As of now there are a lot of zeros present in our matrix. This
is due to the fact that there are allot of different domains but most proteins only contain a
few of these domains. I propose we take into account the absence of p-p interactions by
introducing a negative weight. Yes this is similar to weight functions 3 and 4, but those
weight functions calculate probabilities. With plain addition and subtraction I believe the
algorithm will run much faster and the enormous number of zeroes in our matrix will
come into play by becoming negative numbers. Besides the benefit of speed over the
third and fourth weight function you would be able to give either the existence or the
absence of a p-p edge more or less weight. Which might be desirable depending whether
you believe you data set contains more false positives or is just simply incomplete.
Download