Presentation for Shamir group meeting Interactome under construction: protein-protein interaction and pathway databases 5/1/2011 Based on the papers: Protein-protein interactions: Interactome under construction. Bonetta L. Nature. (PMID: 21150998, Dec 2010) Protein-protein interaction and pathway databases, a graphical review. Klingström T, Plewczynski D. Brief Bioinform. (PMID: 20851835, Sep 2010) Protein-protein Interaction (PPI) and Biological Pathways (BP) databases There are two kinds of databases, each concentrate on one of the two aspects of the biochemical/biological data: (i) Protein-protein interaction (PPI) databases gather data on the physical interactions between proteins. (ii) Biological pathways (BP), databases including metabolic and transport pathways , signaling cascades, and regulation networks, gather data on the biological meaning of PPIs and other possible interaction between gene products. Most of these two kinds of databases enable visualization and producing of maps showing a selected group of interactions. In this presentation I will concentrate mainly on PPIs databases, and BP databases that use PPIs. STRING ATM- 73 interactors (> 0.700 level of integrity) STRING ATM-100 ATM- signaling network MAPK Cascade KEGG map for the development of melanoma Methods for detecting protein-protein interactions There are two main approaches for detecting interacting proteins: techniques that measure direct physical interactions between protein pairs — binary approaches — and those that measure interactions among groups of proteins that may not form physical contacts — co-complex methods. The two main binary methods for measuring of direct physical interactions between protein pairs are: Yeast two-hybrid (Y2H) luminescence-based mammalian interactome mapping (LUMIER) The most common co-complex method is co-immunoprecipitation (co-IP) coupled with mass spectrometry (MS) In addition to these empirical methods, researchers have used computational techniques to predict interactions on the basis of factors such as amino-acid sequence and structural information. The most frequently used binary method is the yeast two-hybrid (Y2H) system. It has variations involving different reagents, and has been adapted to high-throughput screening. The strategy interrogates two proteins, called bait and prey, coupled to two halves of a transcription factor and expressed in yeast. If the proteins make contact, they reconstitute a transcription factor that activates a reporter gene. LUMIER (luminescence-based mammalian interactome mapping) is a method for identifying binary interactions. This strategy fuses Renilla luciferaze (RL) enzyme, which catalyses light emitting reactions, to a bait protein, which is expressed in a mammalian cell along with candidate protein partners tagged with a polypeptide called Flag. Researchers use a Flag antibody to immunoprecipitate all proteins with the Flag tag, along with any that interact with them. Interactions between the RL-fused bait and the Flagtagged prey are detected when light is emitted. The most common co-complex method is co-immunoprecipitation (coIP) coupled with mass spectrometry (MS). In this approach, a protein bait is tagged with a molecular marker. There are techniques to recognize the tag and fish the bait protein out of the cell lysate, bringing with it any interacting Proteins. These proteins are then identified by Mass Spectometry (MS). The binary methods for measuring of direct physical interactions: The yeast two hybrid system A plasmid containing the DNA encoding the DNA-binding domain of a transcription factor needed to turn on expression of a "reporter gene" such as the lacZ gene (that encodes the enzyme β-galactosidase) coupled to the DNA encoding the "target" protein (the protein whose possible partners we wish to identify) is inserted to amating type cell . In a second yeast cells, α-mating type cells, a plasmid with the DNA encoding the activation domain of the transcription factor coupled to the DNA encoding a possible partner ("bait") protein is inserted . Following the mating the α yeast cells with the a type cells If the fusion protein produced by the transcription and translation of a "bait"-containing plasmid can bind to the fusion protein containing the target, the two domains of the transcription factor can interact to turn on expression of the reporter gene (lacZ in our case). Grown on an indicator substrate, these colonies will turn blue. The DNA in these colonies can then be isolated and sequenced. The result: identification of the proteins that can associate LUMIER (luminescence-based mammalian interactome mapping) LUMIER (luminescence-based mammalian interactome mapping) is a method for identifying binary interactions, and a high throughput approach developed. This strategy fuses Renilla luciferaze (RL) enzyme, which catalyses light emitting reactions, to a bait protein, (A in the picture) which is expressed in a mammalian cell along with candidate protein partners tagged with a polypeptide called Flag. Researchers use a Flag antibody to immunoprecipitate all proteins with the Flag tag, along with any that interact with them. Interactions between the RL-fused bait and the Flag-tagged prey are detected when light is emitted. The problem of the false positive PPI reports The integrity of the results of Y2H experiment are relatively low. The integrity of the co-IP experiments are low, partly due to the including of some non-specific partners in a reported PPI, but mainly due to the identification of proteins in complexes, and not direct partners. Possible solutions: 1) Using at least two different methods when analyzing specific PPIs. 2) The interaction data obtained in an experiment can also be combined with that available in public databases, thus providing a more complete picture (for example using known PPIs networks of other organisms, co-expression data, and bioinformatics tools for identification of sequences in the proteins that promote specific interactions between proteins ). The false negative problem One challenge in defining protein–protein interaction networks is that unlike the genome, the interactome is dynamic. Many interactions are transient, and others occur only in certain cellular contexts or at particular times in development. Interactions vary depending on the type of cell and the cellular environment. In the paper “Interactome under construction”, the protein–protein interaction network for TGF-β, a growth factor that regulates cell functions was given as an example for aforementioned complexity: It was found that two proteins that pass on the signals from the factor inside the cell — Smad2 and Smad4 — interact with one another only when the cells are stimulated with TGF-β. If the cells are not stimulated, these two proteins don’t come into contact. It seems that following the stimulation the contact can be formed due to a change in the environment of the proteins, and/or by implementation of specific post translation modifications on these proteins. New methods were developed for identifications of Interactome changes during diseases AQUA (multiplex absolute quantification) is a new method that its aim is to look at dynamic changes in protein interaction networks. AQUA uses synthetic peptides that contain stable isotopes as internal standards for the native peptides that are produced when proteins from a cell lysate are digested. Using tandem MS, researchers can compare the levels of native and synthetic peptides in a cell to obtain a measure of the amount of native proteins present. Synthetic peptides can also be prepared with modifications This method can provide an accurate and sensitive measure of how the stoichiometry of components within complexes that make up a network are altered in response to a stimulus. KAYAK (kinase activity assay for protein profiling) is another approach to developing diagnostic tools for cancer on the basis of the functional consequences of the interaction between a protein, in this case a kinase, and its substrate. In this method, up to 90 peptide substrates for kinases are used to simultaneously measure the addition of phosphate groups to proteins in a cell lysate — in essence providing a ‘phosphorylation signature’ for that particular cell. Some examples for PPI and biological pathways databases PPI databases BIOGRID (http://thebiogrid.org) STRING (http://stringdb.org) Dip (http://dip.doe-mbi.ucla.edu/) MINT (http://mint.bio.uniroma2.it/mint/Welcome.do) INTERACTOME (http://www.ebi.ac.uk/intact/main.xhtml) HPRD (http://hprd.org/) BIND (http://bind.ca/) Biological Pathways Databases SPIKE (http://www.cs.tau.ac.il/~spike/ and http://spike.cs.tau.ac.il/spike2/ ) REACTOME (http://www.reactome.org/) KEGG (http://www.genome.jp/kegg/) GeneMANIA (http://genemania.org/) CYTOSCAPE (http://www.cytoscape.org/) is an open-source software for network visualization It is the most important site for visualization of PPI and biological pathway databases NCBI_GENE (http://www.ncbi.nlm.nih.gov/gene/) is the data source for the human genes, but gather also relevant data on the gene/gene products interactions and regulations. SPIKE imported data on PPIs from INTERACTOME PPI database and from REACTOME and KEGG Biological pathways databases. PPI databases categorization and qualifications Stand-alone databases: BIND, DIP, HPRD, IntAct and MINT do not incorporate data from other databases. BioGRID imported the HPRD and Flybase databases in 2006, but have not added any more data from other databases since then. Topical databases: DroID (PPIs in Drosophila melanogaster), MatrixDB (extracellular PPIs), InnateDB (PPIs in the immune system) and MPIDB (PPIs in microbes) combine datamining from other source databases with their own curation efforts. Metamining databases: APID, MiMI and UniHI are with the mission to unify source databases into a single comprehensive source meta-database. Predictive interactions databases: HAPPI,, STRING STITCH and Scansite. STRING combines known interaction data from interaction databases BIND, BioGRID, DIP, IntAct MINT and HPRD with interactions from the pathway databases PID, Reactome, KEGG and EcoCyc. Inconsistencies in the definition of proteins’ “interaction” Three different classes of proteins interactions are used by databases, sometimes even without separation: binary physical interactions, same-complex belonging (non-direct interactions) , and non-physical functional interactions. Due to these inconsistencies in the “interaction” definition, there is a confusion regarding the size of the human interactome: Venkatesan et al estimates the size to 130,000 interactions, Hart et al. to 154,000–369,000 interactions and Stumpf et al. to 650,000 interactions. Closer inspection reveals that each team has defined its own search space as the human interactome: Venkatesan et al. use the most restrictive definition and only include binary physical interactions, Hart et al. use in-house experimental data obtained by IP-MS to create its source networks which means that proteins belonging to the same protein complex are also considered to be interacting, thus increasing the size of their defined interactome. Stumpf et al. rely on a combination of yeast two hybrid (Y2H) derivated data sets and literature curated data from DIP and IntAct. Some Literature curated databases uses a more flexible definition of “interaction”: some of the papers considers also non-physical functional interactions to be a form of interaction. This definition enlarged significantly the number of interactions. With the current technologies the human known PPIs are ~35,000, only about 1/4 of the estimated number of interactions, so the central problem in the construction of the Interactome is the false negative problem – the known interactions are just we the tip of the iceberg and we still need to identify a huge amount of PPIs. Some examples for the organisms and volume of PPI and BP databases BIOGRID (http://thebiogrid.org/) 50 model organism species The online interaction repository with data compiled version 3.1.71 includes 362,355 raw protein and genetic interactions from major model organism species. STRING (http://stringdb.org) STRING is a database of known and predicted protein interactions. The interactions include direct (physical) and indirect (functional) associations; STRING quantitatively integrates interaction data from these sources for a large number of organisms, and transfers information between these organisms where applicable. The database currently covers 2,590,259 proteins from 630 organisms. GeneMANIA (http://genemania.org/) Indexing 817 association networks containing 185,324,281 interactions mapped to 135,148 genes from 6 organisms. HPRD (http://hprd.org/) 39,194 Protein-Protein Interactions (human) Dip (http://dip.doe-mbi.ucla.edu/) more than 80 genome BUT for human they have only 2529 proteins, 3376 interactions INTERACTOME (http://www.ebi.ac.uk/intact/main.xhtml) Contains: 234,147 binary interactions. 69,669 proteins. MINT (http://mint.bio.uniroma2.it/mint/Welcome.do) 30 organisms, 90503 interactions (21938 human) SPIKE: (http://www.cs.tau.ac.il/~spike/) and http://spike.cs.tau.ac.il/spike2/ 34266 interactions (human only) STRING_ATM_Interactors Interactions databases Metamining and predictive PPIs databases The PPI community has been characterized by a wide and open distribution of proteomic data through the collection of PPI and pathway databases. The ability to distribute and share data between various research groups has resulted in a large number of different source databases. However, the general overlap between PPIS databases is limited which means that a common procedure for researchers is to unify these diverse data sets to support their own work. Several metamining databases have been created that perform such unification. This has lead to the spontaneous development of a network of data exchange between literature curated databases, metamining databases and databases generating predicted PPIs. The exchange of information is supported by three major data exchange formats: BioPAX, PSI-MI and SBML. Predictive interaction databases Metamining databases Pathway databases