Introduction/Biology: Protein-protein interactions serve as the chemical basis of all living organisms. Derived from the nuclear material within a cell, proteins fold and interact in intricate arrangements that provide functionality to the components of a cell, which in turn work cooperatively to form whole body systems. Protein domains, which are significant portions of proteins composed of distinct peptides, are the key to such intricate arrangements and drive the proteins to fold and interact as they do. A single protein molecule can possess multiple domains, causing difficulty in discovering a simple formula that dictates the manner by which protein-protein interactions occur. Yet, certain affinities exist between certain protein domains and are frequently seen in living organisms. This drives our research that seeks to extrapolate the mechanism of proteinprotein interactions to focus on domain-domain interactions as a factor. The model system used for these proceedings is the yeast cell, with several of its proteins serving as the test cases. There are protein family data banks available, which contain information about certain protein structures, i.e. which domains are in a certain protein and which proteins have been found to interact using experimentation. We approximate the minimum number of domain pairs, which are needed to explain all the protein interactions read from the family data bank. A protein interaction, (P1, P2), is explained if a domain pair, (D1, D2), is chosen such that P1 includes either D1 or D2 as one of its domains, while P2 includes the other as one of its domains. This is known as the "Minimum Set Cover Problem", which is the problem of finding the minimum size set of sets whose union is equal to the union of all the sets. The "Minimum Set Cover Problem" in a NP complete problem. We will show a proof showing that we are dealing with the "Minimum Set Cover Problem" later in the paper. Explain why our method is a good predictor. Show how our algorithm is different, show in results section comparison with other algorithms out there. Algorithm: An algorithm has been implemented in order to find an approximation to this minimum set. For this algorithm to work efficiently and produce a good approximation, it needs to be able to choose the domain pairs in an educated, not a randomized manner. This educated way can be done using weight functions. Where each domain pair is given a weight, and the largest of the weights is chosen. Details on the distribution/calculation of weights will follow later in paper. Four different weight functions have been developed and implemented for the purposes of this project. This base algorithm consists of functions that can record the protein structure and interaction information and store them into different data structures. It also builds a domain-domain matrix. This matrix holds information about interacting domains. Each entry in the matrix represents the number of times domains Di and Dj were observed as the possible cause in different protein-protein interactions. For example, lets assume proteins P1 and P2 interact. P1 contains domains D1, D2 and D3 while P2 contains domains D1 and D5. This interaction would cause six entries in the matrix to be incremented by one each. These entries correspond to domain pairs (D1, D1), (D1, D5), (D2, D1), (D2, D5), (D3, D1) and (D3, D5). This process continues on until there are no more protein - protein interactions to be observed. Then the algorithm approximates the minimum set of domains pairs, which explains all the protein interactions using one of the four different weight functions. First Weight Function: The first weight function relies on the assumption that the most common observed interacting domain pair among the protein interactions is probably the cause of the protein interactions. This weight function explains all the p-p interactions by finding the most common pair of domains present among p-p interactions. Then essentially removing this d-d pair and all the p-p interactions from the data being observed then it goes through the whole process again until there are no more p-p interactions to be observed. It initializes the domain-domain matrix by reading in the information from all the protein interactions available to it as explained in the algorithm. It chooses the greatest element in the matrix, which represents the most common domain pair, (Di, Dj), present among interacting proteins. Since domain pair, (Di, Dj), is chosen the matrix is then modified. The algorithm goes through all the protein interactions, which could have been caused by (Di, Dj) and essentially undoes any influence these protein interactions had on our domain matrix by going through each possible domain pair combination in each protein pair and decrementing its corresponding element in the matrix by one. Second Weight Function: The second weight function is very similar to the first but instead of finding the most interacting d-d pair it finds the most interacting domain present among the p-p interactions and then performs the same processes as the first weight function.. The second weight function initializes the matrix by reading in the information from all the protein interactions available to it as explained in the algorithm. It also creates a vector, sum_vector, with its size being the number of rows in the matrix. Then it goes about summing up each row in the matrix and stores the value in the corresponding value in the sum_vector. For example, after the weight function sums up row i in the domain-domain matrix it stores that value in the ith entry of the vector. Then the weight function finds the maximum element in the sum_vector and then returns the maximum element in the corresponding row of the domain-domain matrix. Now just like the first weight function, since domain pair, (Di, Dj), is chosen the matrix is then modified. The algorithm goes through all the protein interactions, which could have been caused by (Di, Dj) and essentially undoes any influence these protein interactions had on our domain matrix by going through each possible domain pair combination in each protein pair and decrementing its corresponding element in the matrix by one. Third Weight Function: The third weight function tries to incorporate the absence of p-p interactions. For example, if the domain pair (Di, Dj) is observed to be the most interacting, but this is because it they are the most common domains present. There might be many proteins which contain these domains but do not interact. We should take this into affect. To incorporate this idea we will initialize our matrix just like the first weight function. Then we go through every element in the matrix and divide the number there which represents the number of d-d interactions observed among p-p interactions and divide each number by the total number of proteins that contain the first domain times the number of proteins which contain the second domain. By doing this each element now represents the probability that domains i and j interact. Then the weight function goes about choosing the highest probability in the matrix, seeing which proteins this domain pair explains, remove these proteins influence from the data and then performing the same tasks again. Fourth Weight Function: The fourth weight function assumes that the absence of p-p edges is because the data is not complete not because these proteins do not interact. So if a domain pair is present among a lot of the proteins observed. Only the interacting proteins will be taken into account. In the d-d matrix each entry represents the probability of domain i interacting with domain j just like the third weight function. But the probability is now calculated by subtracting the probability that domain i,j is not the cause of the interaction from one. Complexity: Now these algorithms are implemented using matrices composed of vectors. This data structure is costly when it comes to memory management. It stores a lot of zeroes, which the first 4 algorithms do not obtain any additional knowledge from. It also is not efficient when it comes to finding the maximum element. However, it is able to find or modify any element very efficiently. Some other possibilities: A- Matrix (vector of adjacency lists) PROS: fast to find a domain[i] but a little slower to find a max combination with domain[j], faster than C but slower than A easy to update adjacency list CONS: still stores some zeros, more than C but less than A unless we sort lists finding max will be costly B- Matrix (vector of heaps) PROS: also fast to find domain[i] and even fast to find max CONS: will need to update heap every time max is chosen will also store some zeros like B will be a little more memory costly than B but less than A C- Matrix (a heap of adjacency lists) PROS: very memory efficient CONS: will be hard to find a certain domain, well more costly. We will also be implementing our algorithm using a matrix, which is composed of a heap of adjacency lists since we believe it will be more desirable when it comes to certain cases where speed is a factor. Testing: Five different test cases were created to test the algorithm to make sure our algorithm is doing what we think it is. (more added once testing is finished) Work in Progress: As of now we are looking at other weight functions and finding different methods to compare them. One comparison to be made is to see which weight function needs the smallest number of domain pairs to explain the data. Another comparison can be made by seeing how many positive, negative, false positive, and false negative p-p interactions can be predicted in a set of data other than our training data and comparing these various numbers. Now this new set of data could be composed of our training data, disjoint, related to our training data, or not. -A composed data set would imply that learning data set is included among other observed protein interactions. -A disjoint data set implies that none of the protein interactions used in the training data are present in this new data set. -A related data set to the training data set could be p-p interactions observed in the same type of biological organism, same organ etc... - A unrelated data set could just be a different data set composed of p-p observations made from a different type of organism, different organ etc... We also plan on comparing our results with different results of existing programs by replicating their method for results using our algorithm. Some of the papers we plan on comparing with are: .Paper .Paper .Paper "Need to figure out which papers we are comparing with" . etc... .Paper The p-p database that we have been testing our different weight functions is composed of approximately 2300 p-p interactions. There exists a more up to date version of this data base which consists of approximately 12000 p-p interactions. We will be testing on this database pending an increase in the size of my AFS space and some optimization on the data structures in the algorithm, if the present algorithm is taking too much time to run. Some other weight functions that will be implemented: Weight function 5: As of now there are a lot of zeros present in our matrix. This is due to the fact that there are allot of different domains but most proteins only contain a few of these domains. I propose we take into account the absence of p-p interactions by introducing a negative weight. Yes this is similar to weight functions 3 and 4, but those weight functions calculate probabilities. With plain addition and subtraction I believe the algorithm will run much faster and the enormous number of zeroes in our matrix will come into play by becoming negative numbers. Besides the benefit of speed over the third and fourth weight function you would be able to give either the existence or the absence of a p-p edge more or less weight. Which might be desirable depending whether you believe you data set contains more false positives or is just simply incomplete.