Improving PPI Networks with Correlated Gene Expression Data Jesse Walsh Background • PPI networks are currently derived either computationally or experimentally • It is well known that there are a great number of false positives and false negatives in computationally derived networks • High-throughput – Two major published yeast PPI experiments showed only ≈150 similar interactions out of thousands [1] Goal: Improve the Data Quality • The goal of this study is to improve the quality of computationally predicted protein-protein interactions • Hypothesis: – Proteins that interact may also have similar expression patterns – Gene coexpression is correlated to PPIs Previous Work • Deane et al. (2001) [2] – Proposed EPR metric: use gene expression profiles to assess the quality of computationally predicted protein-protein interactions Figure adapted from Deane et al. [2] Glimpse at the Data Interaction Data: DIP dataset statistics Genome Size Interactions Proteins E. Coli (2008) 4.6 million bp 7447 1863 Yeast (2001) -- 8063 4150 Yeast (2008) 12.5 million bp 18440 4943 DIP (Database of Interacting Proteins) http://dip.doe-mbi.ucla.edu/dip/Main.cgi [3] Affymetrix Expression Data: M3D (Many Microbe Microarrays Database) Number of Experiments: 466 Number of Chips: 907 Genes: 4298 M3D (Many Microbe Microarrays Database) http://m3d.bu.edu/cgi-bin/web/array/index.pl?section=home [4] Expression Data Selection • Concerned about complications from adding to many expression conditions – Knockouts, over-expression, foreign genes • Selected a group of 20 conditions that were published as a part of the same experiment – Hope for more homogenous data – Allen et al. [5] Method Gene Expression Distance • Expression Distance given by: • Summed over 20 conditions – Treated the first condition (wild-type anaerobic) as the reference condition for lack of a true control Expression Distance equation from Deane et al. [2] Method DIP Data • DIP labels as ‘core’ or ‘non-core’ – Corresponds roughly to small scale experiments and high-throughput experiments • 3 Interaction Datasets – Core • Core interactions – Non • Non-core interactions – Rand • 100,000 random interactions were created Method Mapping • DIP interaction set used uniprot protein identifiers, while M3D used gene ids • Ran a blast of protein sequences against translated E. coli genome to map the datasets together • Lost most of my data on this step Number of Interactions Mapping Available CORE 991 220 NON 5999 903 RAND 100,000 100,000 Results Density distribution of squared distances Bin size = .05 Results from Deane et al. [2] Bin size = 1.25 Figure adapted from Deane et al. [2] Results from Deane et al. [2] Least Squares Factorization Figure adapted from Deane et al. [2] Discussion • Mapping/Demerging problem – Kept 1044 of my 6991 interactions (15%) • Case study P22885 – Obsolete since 2005 – Demerged to P0A8P6 and P0A8P7 • Tyrosine recombinase xerC • All three have a perfect ClustalW match Discussion • Shape of curve – Multimers and proteins that link to themselves in the PPI network (the zeros problem) – 66.6% of Core, 43.6% of Non, <0.1% of Rand Conclusion • Cannot predict novel interactions • Cannot assign confidence values to individual interactions • Can provide some measure of the overall quality of a PPI dataset Thank You • • • • • • References: [1] Deeds EJ, Ashenberg O, Shakhnovich EI. “A simple physical model for scaling in protein-protein interaction networks.” Proc. Natl Acad. Sci. USA (2006) 103:311–316 [2] Charlotte M. Deane et al. “Protein Interactions: Two Methods for Assessment of the Reliability of High Throughput Observations.” Molecular & Cellular Proteomics 1.5 349-356 [3] Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D (2004) The Database of Interacting Proteins: 2004 update. NAR 32 Database issue:D449-51 [4] Faith JJ, Driscoll ME, Fusaro VA, Cosgrove EJ, Hayete B, Juhn FS, Schneider SJ, and Gardner TS. Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata. Nucleic Acids Research [5] Timothy e. Allen et al. “Genome-Scale Analysis of the Uses of the Escherichia coli Genome: Model-Driven Analysis of Heterogeneous Data Sets.” J Bacteriol. 2003 November, 185(21): 63926399