Jesse Walsh BCB 569 December 18, 2009 Final Project Improving PPI Networks with Correlated Gene Expression Data Background: Protein-protein interaction (PPI) networks attempt to model the vast number of interactions between proteins in a cell. These networks are crucial to our understanding of problems such as functional annotation, protein complex identification, and signaling cascade determination. As such, the quality of the data is very important. In practice, many datasets, such as the Database of Interacting Proteins (DIP) [3], consist of some interactions deduced from small scale experiments and a large number of them from high-throughput methods such as yeast two-hybrid. The quality of the small scale experiments are generally considered to be high due to a variety of experimental methods and close expert curation of the data. It is well documented, however, that data obtained from high-throughput methods tends to have a large number of false positives and negatives. Additionally, there is commonly little overlap between the results reported from different experiments. For example, two major published yeast PPI experiments showed only ≈150 similar interactions out of thousands [1]. It is for this reason that methods to evaluate and improve the quality of PPI data are needed. Hypothesis: One less commonly used method of assessing the quality of a set of high-throughput PPI data is the EPR index proposed by Deane et al. [2]. This method extracts gene expression data under several conditions for members of a PPI network, and then compares the density distribution of the expression distances to the density distribution of a known high-quality set. I would like to propose using the EPR index to assess the quality of the E. coli PPI data within DIP [3] using expression data from the Many Microbe Microarray Database (M3D) [4]. Merit: For my original proposal, I had proposed using expression data to assess the quality of PPI networks. After discovering the Deane et al. paper on the EPR index [2], I felt as though this method was a solid implementation of my original hypothesis. I believe this project still has merit based on the fact that I will be applying the EPR method on the E. coli dataset, while Deane et al. [2] applied this method to the yeast dataset. Also, this paper was based on PPI data published in 2001. There have been significant contributions to PPI data since the EPR analysis was done, and as such the current dataset can be assumed to contain more complete protein interaction data. Overview of the Data: Genome Size Interactions Proteins E. Coli (2008) 4.6 million bp 6991 1863 Yeast (2001) -- 8063 4150 Yeast (2008) 12.5 million bp 18440 4943 Figure 1: Statistics for DIP (Database of Interacting Proteins) It can be seen from figure 1 that DIP [3] contains ~7000 protein interactions with 1863 unique proteins. The yeast (2001) data is provided to give a sense of the completeness of the E. coli dataset compared to the set used by the authors of the EPR index. The latest yeast (2008) statistics are provided to give a sense of how the size of the data has grown. Both of these points are meant to lend confidence to the usefulness of performing an EPR analysis of the current E. coli data. The expression data used in this project were retrieved from the M3D database [4]. This data consists of a universally normalized Affymetrix compendium of E. coli expression data. There are 907 chips from 466 experiments available. For use with the EPR index, it is important to have enough variance in the conditions of the expression data in order to have a reasonable sampling of gene expression relationships. Conversely, many of the conditions in the M3D data consist of knockouts, gene perturbations, and foreign gene expressions. In order to avoid adding any confounding effects to my data from these artificial gene linkages, a smaller set of 20 experimental conditions were chosen. While there is no mathematical basis for this choice, the choice was influenced by the fact that the authors of the EPR index used a set of 12 conditions for their work. These 20 conditions were a part of a single publication by Allen et al. [5]. Method: Equation 1: Expression distance between two genes [2] Equation 1 was used to calculate the expression distance between genes in the M3D dataset. The value ei refers to the Affymetrix expression value for gene A or B in condition i. The reference (ref) expression value is meant to be the control of the expression data. Since there was no control condition, the condition that most resembled a control was chosen, which in this case was under the conditions: wild type, MOPS media, 5.55mM glucose, log phase, taken at time point 0. Averaging over the available conditions in order to create a reference set was an option that was considered, but did not seem to offer any considerable advantage over the condition chosen. Three PPI datasets were considered in this analysis. Interactions that have been verified by small scale experiments or were identified in several experiments were collected into the CORE set. Interactions that were only inferred from high-throughput methods were collected into the NON set. 100,000 randomly selected protein interactions were selected from the available genes in the M3D data. This data has a low overlap with the CORE and NON sets, and can thus be considered a random distribution of primarily non-interacting pairs of proteins. Identifier Number of Interactions Mapping Available Core Interactions CORE 991 220 Non-core Interactions NON 5999 903 Randomized Set RAND 100,000 100,000 Figure 2: Datasets used in analysis In order to extract expression distances for the interactions in the PPI datasets (CORE, NON, and RAND), the uniprot protein identifiers in the interaction data had to be mapped to the gene identifiers in the expression data. This was done by blasting the protein sequences against the translated E. coli genome. This resulted in 220 interactions from the CORE set and 903 interactions from the NON set mapping to expression values. The RAND set was selected from the gene list, and as such did not need to be mapped. The low success rate of this mapping is discussed in the discussion section. Results: p(d2AB) Figure 3: Density distribution p(d 2 A B ) of CORE, NON, and RAND sets. Bin Size = 0.05 Figure 4: Results adapted from Deane et al. [2] Bin Size = 1.25 Figure 4 shows the results from the Deane et al. [2] paper as a point of comparison. Results from this analysis are show in figure 3. Figure 3 plots the density distribution of the expression distances d2AB against the distances. In this figure, the CORE set is considered representative of the distribution of a highly accurate set of interactions (accurate in the sense of few false positives or false negatives). The RAND set is considered representative of the distribution of all non-interacting pairs. The NON set is the a set of high-throughput interactions. The CORE set is equivalent to the INT set of figure 4, the NON is equivalent to the Experimental set of figure 4, and the RAND set is equivalent to the RND1 set in figure 4. The key observation is that the NON set appears to be intermediary between the CORE and RAND sets (this is easier to see in figure 4 where the experimental set is intermediary between the INT and RND1 sets). This leads to the conclusion that the NON set can be represented by randomly selecting a certain amount of interactions from both the CORE and RAND sets. Deane et al. proposed Equation 2, where solving for the linear least squares yields a value for αepr which represents the fraction of NON that are true interactions. Equation 2: Solving for the linear least squares yields α e p r , the fraction of true interactions in ρ e x p . ρ e x p is the distance distribution for the experimental dataset (NON), ρ i is the distance distribution for the interacting dataset (CORE), and ρ n is the distance distribution for the non interacting dataset (RAND). This value was regretfully not calculated for the data used in this analysis. There were technical difficulties that arose preventing me from solving for αepr before this project was due. Discussion: There was a very unfortunate complication in the protein-to-gene mapping step which forced the removal of 85% of the original interaction data. The issue arose from the inclusion of outdated protein identifiers in the DIP dataset. These 780 of the 1863 identifiers, when queried for a sequence on Uniprot [6] website, returned with a message stating that the identifier was no longer valid and the protein had been demerged into several new protein identifiers. As a case study, one protein identifier, P22885, was demerged into P0A8P6 and P0A8P7. When the sequences for these were retrieved and aligned, all three identifiers returned with identical sequences, and the two new identifiers returned with identical annotations. This problem presented an issue with our confidence in the core interactions that had been demerged. While it would be possible to map each new id of a demerged protein against each demerged id of its interaction partner, there would be no way to show which of the new m*n interactions were true. As such, these values had to be removed to preserve the integrity of the CORE set. For the same reasons, the demerged proteins were also removed from the NON set. The quantity of map-able identifiers is given in figure 2. There exists an unexplored possibility that the one to many relationship of the old demerged proteins to their new identifiers could in fact map many to one at the gene level, effectively undoing the demerge and allowing the interactions to be used. This represents a technical challenge that should be pursued in the future. The second major point I would like to discuss comes as a result of the density plot in figure 3. When comparing figure 3 to figure 4, it is evident that the CORE and NON sets follow a different distribution from the RAND set or the sets in figure 4. This stems from the very high percentage of interactions that are interactions of a protein with another identical protein. If a protein interacts with another of its type, the expression distance calculation yields a value of 0. 66.6% of the CORE interactions were of this type, while 43% of the NON interactions were of this type. This presents a start contrast to the values seen in figure 4, where the lowest bin shown represents 5% or less of the dataset. The authors did not make any comments on the amount of distance values equal to zero, or comment on any special treatment of them. I would also like to point out that, while the mapping issue described above inflated the percent of the NON and CORE datasets represented by identity interactions, the original 6991 interactions were composed of 15% identity interactions. This is significantly more than the amount in the yeast data (5%). It would be an interesting line of inquiry to find out if there is any biological significance to this large representation of identity interactions in E. coli when compared to yeast. Conclusion: The conclusions that can be drawn from this experiment are thin. The original questions I wanted to answer when I wrote my proposal included using expression data to predict novel interactions, and using expression data to assign confidence values to individual interactions. This analysis makes use of the distribution of expression distances between interacting proteins, and as such can only be used to assess the overall quality of a dataset, rather than of individual interactions. While a confidence value was not obtained, this project created several new questions to be answered, and has served as a good opportunity to learn about the challenges of handling large quantities of data of different types. References: [1] Deeds EJ, Ashenberg O, Shakhnovich EI. “A simple physical model for scaling in protein-protein interaction networks.” Proc. Natl Acad. Sci. USA (2006) 103:311–316 [2] Charlotte M. Deane et al. “Protein Interactions: Two Methods for Assessment of the Reliability of High Throughput Observations.” Molecular & Cellular Proteomics 1.5 349-356 [3] Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D (2004) The Database of Interacting Proteins: 2004 update. NAR 32 Database issue:D449-51 [4] Faith JJ, Driscoll ME, Fusaro VA, Cosgrove EJ, Hayete B, Juhn FS, Schneider SJ, and Gardner TS. Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata. Nucleic Acids Research [5] Timothy e. Allen et al. “Genome-Scale Analysis of the Uses of the Escherichia coli Genome: ModelDriven Analysis of Heterogeneous Data Sets.” J Bacteriol. 2003 November, 185(21): 6392-6399 [6] The UniProt Consortium The Universal Protein Resource (UniProt) Nucleic Acids Res. 37:D169-D174(2009). [6] Jain E., Bairoch A., Duvaud S., Phan I., Redaschi N., Suzek B.E., Martin M.J., McGarvey P., Gasteiger E. Infrastructure for the life sciences: design and implementation of the UniProt website BMC Bioinformatics 2009, 10:136.