Advances in Environment, Computational Chemistry and Bioscience Protein Complexes Enriched with Cancer Proteins Chien-Hung Huang Department of Computer Science and Information Engineering, National Formosa University, Yun-Lin, 632, Taiwan chhuang@nfu.edu.tw Szu-Yu Chou Department of Computer Science and Information Engineering, National Formosa University, Yun-Lin, 632, Taiwan 19966157@gm.nfu.edu.tw Ka-Lok Ng* Department of Biomedical Informatics, Asia University, Taichung, 413, Taiwan ppiddi@gmail.com Abstract: Proteins participate in many aspects of biological processes within an organism, but they rarely function in isolation. More specifically, most proteins achieve a particular function by interacting with other proteins to from protein complexes. In this work, enrichment analysis is adopted for studying protein complexes enriched with cancer proteins; which include oncoproteins (OCP), tumor suppressor proteins (TSP), and cancer protein acts as OCP and TSP. Furthermore, we construct the protein complex interaction network, and studied the pattern of interaction of the network. A total of 1818 human protein complexes are retrieved from the MIPS database. Cancer protein data are obtained from the Tumor Associated Gene (TAG) database, and Memorial Sloan-Kettering Cancer Center (MSKCC). Protein-protein interaction (PPI) data are derived from the BioGrid database. We adopted the MATLAB package to pre-process the data, identified PPI, conducted data statistical analysis and presented visualization by animation. Hypergeometric analysis indicated that 248 and 36 protein complexes are enriched with cancer proteins, at a statistical significant level of 0.05 and 0.01 respectively. It is found that complexes consisting of cancer proteins tend not to interact with each other. Compare to the interaction probability between cancer-related protein complex (CRPC) pair and that of CRPC and non-CRPC pair, the ratio is 0.124. This suggested that CRPC tends to interact with non-CRPC. Understanding the biological significance of those complexes enriched with known cancer proteins and their interactions may play a crucial role in studying the cause of cancer and biomedicine. It is expected that the present work can lead to a better understanding of cell growth, differentiation and apoptosis. A web service related to this work has been set up which can be accessed at http://bioinfo.csie.nfu.edu.tw:8080/ProteinComplex/Default.aspx. Key-Words: protein complex; cancer; oncoprotein; tumor suppressor protein; protein-protein interaction; enrichment analysis also likely connected to the same disease [2, 6, 10]. 1 Background The cause of cancer is closely related to the gain of function of an oncoprotein (OCP) or the lost of function of a tumor suppressor protein (TSP), but the relationship among these cancerrelated proteins is still largely unknown and uninvestigated. The cause of disease is often associated with many proteins, and there are great chances that these proteins are mutually regulated in biological functions. Several researches have suggested that two proteins participating in the same protein-protein interaction (PPI) have highly similarity in their biological function, therefore, if a protein is related to a disease, then its partners in PPI are ISBN: 978-1-61804-147-0 Recent experimental studies showed that most of the proteins do not work alone in biological processes within an organism, but they have temporary or stable interactions with other proteins. Cellular functions are performed by protein complexes which are generally composed of many proteins. Protein complexes play critical roles in integrating individual gene products to conduct many cellar functions. For example, α 3β 1 tetraspanin protein complex is of vital importance in regulating protruding activity in tumor cell [13], and the complexes consisting of PDZ protein are critical in constructing cell-cell adhesions and epithelial cell polarity processes [11]. Hence, to identify the 273 Advances in Environment, Computational Chemistry and Bioscience disease specific functional modules. They also indicated that the majority of human essential genes are hub proteins expressed in many organs, but most of the disease causative genes are not essential genes and are existed in functional periphery of the network. A protein is composed of many functional domains; therefore, domain-domain interaction (DDI) is used by many studies to examine PPI. Jonsson and Bates employed PPI data to analyze cancer-related proteins and domains [5]. They used graph theory to analyze the cancer-related PPI network, and found that the interactions mostly are induced by cancer-related proteins than those of non cancer-related proteins. Besides, the authors also listed the first twenty most frequently domains, many of them are significantly related to DNA manipulation and restoring, such as Zinc-finger, PHD-finger, BRCT and paired-box domains. Chan established the weighted form of functional region for the cancer-related protein [1]. Chan’s study suggested that some domains having higher tumor stimulated weight, such as protein kinase domain, tyrosine protein kinase, SH2, SH3 and pleckstrin-like domain. There are also some domains having higher tumor suppressed weight, such as armadillo-like helical, ankyrin, cullin and exostosin-like. Adopting the weighed score to match the complete cDNA sequence in human database, the author identified some novel cancer-related proteins. Schuster-Bockler and Bateman applied PPI data to analyze DDI [12]. They computed the distribution of DDI, obtained from iPfam, in several PPI databases, such as HPRD [9], MPact [4], BioGRID, DIP [14] and IntAct. The results indicated that the majority of PPI can be explained by a few DDI combinations. In addition, this paper indicated that quite a few DDI combinations also exist in crossspecies, which shows that DDI is quite conservative. Lee et al. employed integrated heterogeneous data to predict DDI [8]. In their paper, the authors integrated the PPI data of yeast, C. elegans, D. melanogaster and H. sapiens from DIP database, domain fusion and domain function. Bayesian method was used to integrate these data, and then a scoring function was constructed to derive the highly correlated DDIs. The authors listed ten common sets of DDI, and finally the predicated results were compared with iPFam database, the results showed that this method has decent accuracy. Guimaraes et al. proposed a scheme based on PPI data to predict DDI [3]. In their work, the authors employed PPI network and parsimony principle to predict DDI, and then used linear members in a protein complex is an elementary step to understand various biological processes. In attempting to further analyze the cancerrelated protein complexes (CRPC) and the interactions between them, this article employed PPI data to construct protein complex interaction network. The findings will provide some useful information to cancer researcher, such as: (1) identify the CRPCs comprising with significant higher number of OCP or TSP, and (2) characterize the interaction pattern of the CRPC network. This study calculated the probability of a protein complex having at least one cancer protein, this result could be used to predict new CRPC. Furthermore, the biological functions in which CRPC participate are highly related to cell growth, differentiation and activity. Therefore, building a model to quantify the corresponding correlation factor will provide some useful information to annotate those function-unknown CRPC. Due to the availability of massive PPI data in recent years, a large-scale investigation becomes possible. In this research, we suggested to integrate the PPI data and cancer protein data to determine which cancer proteins often appear in a CRPC. In a previous work, Oti et al. applied PPI data to predict disease [10]. They collected 10894 human proteins (only 6005 proteins of them came from the true human proteins, others were inferred from ortholog) and 72940 PPIs. Besides, the authors also collected 432 loci of 383 diseases. If a protein whose interaction partner nears these loci, then it is predicted to be a possible disease causative protein. However, there are some drawbacks of this method. First, PPI data from yeast two hybrid (Y2H) have high false positive and false negative problems; second, the accuracy of PPI data inferred by crossspecies ortholog is questionable; third, the disease gene loci from OMIM Morbid Map are different from those from Ensembl. Kar et al. used PPI data and protein structure information to analyze cancer proteins [6]. The authors integrated human cancer PPI network and protein structure data to derive some cancer related protein properties. They found that, when compared with non-cancer proteins, cancer-related proteins have four features: smaller binding sites, more flat, highly electric charge and less hydrophobicity. Due Goh et al. discussed the relation between human disease network and PPI data [2]. In their work, the authors constructed human disease network, and they found that the disease causative proteins from same disease have higher probability of interacting with one another and have higher transcript expressions, which mean the existence of ISBN: 978-1-61804-147-0 274 Advances in Environment, Computational Chemistry and Bioscience protein complex network. To measure the interaction probability between CRPC pair, denoted by p(C-C), and that of CRPC and non-CRPC pair, denoted by p(C-X) (this number comprises of both p(C-X) and p(X-C)), then: p(C − C) Ca (2) Rexp = = a2 b p(C − X) C1 C1 where Rexp , a and b are the expected ratio of p(C-C) compared to p(C-X), the total number of CRPC and that of non-CRPC, respectively. We defined the observed ratio, Robs to be the observed numbers of C-C and C-X interactions in the protein complex interaction network Thus, if the ratio of Robs to Rexp is less than one, it implies the observed pattern of interaction is suppressed relative to the expected value. programming optimization method to estimate the reliability of the predicted DDIs. Krycer et al. employed PPI and DDI data to confer protein complex network [7]. In their work, the authors investigated that PPI and DDI data can be used to interpret the core-module mechanism in protein complex. 2 Method 2.1 Data source The OCP and TSP data are derived from the following three databases: (1) Tumor Associated Gene database of Taiwan national Cheng Kung University (http://www.binfo.ncku.edu.tw/TAG/), (2) Memorial Sloan-Kettering Cancer Center and (3) National Yang Ming University. The PPI and protein domain data are obtained from BioGrid and Pfam. This research collected 536/139 OCG/OCP and 900/422 TSG/TSP. The number of OCP and TSP is less than that of OCG and TSG, respectively, which is due to that some genes have no Uniprot numbers. The above data are integrated with PPI data to derive PPI data for OCP and TSP. A total of 1818 protein complexes are analyzed for the present study. 2.4 Protein complex interaction network Protein interaction network can be treated as a simple undirected graph, where each protein is mapped to a node and the interaction between two proteins is mapped to an edge. The visualization graph proposed in this work can clearly show the interaction between each protein pair and the protein attributes, and it can be further combined to various graph clustering algorithms to predict protein attributes and protein complexes. 2.2 Measure the probability of cancer protein appearing in protein complex 3 Results The probability of a protein complex consisting of at least one cancer protein can be computed by hypergeometric distribution analysis. Let P(x,y) denote the probability that the protein complex consists of x cancer proteins and y non-cancer proteins, then the probability is given by: p ( x, y ) = C xN C yN −n C xN+ y We proposed some statistical analysis about cancerrelated proteins in protein complex, and the calculations were carried out using MATLAB. 3.1 Measure the probability of a cancer protein appearing in protein complex By hypergeometric distribution analysis, it indicated that 248 and 36 protein complexes are enriched with cancer proteins, i.e. more than 50% of the complex’s subunits are cancer proteins; with statistical significant levels at 0.05 and 0.01 respectively. It appears to be an interesting issue to further explore the relationship between these cancer-related protein complexes and the formation of cancers in the future. (1) where C xN = N ! , N and n represent total (( N − x)! x! ) number of proteins and cancer proteins in protein complexes, respectively. And x+y is total number of proteins in a certain protein complex. 2.3 Measure the interaction between CRPC and non-CRPC 3.2 Measure the interacted probability among cancer-related protein complexes probability The computed results of p(C-C) and p(C-X) are 0.021 and 0.166, respectively. Therefore, the ratio of p(C-C) compared to p(C-X) is estimated to be 0.124, which means complexes consisting of cancer proteins tend to interact with non-CRPC. The subunit (protein) in a protein complex may interact with another subunit in other protein complex. If we treat a protein as a node and the interaction among the subunit pair in different complexes as an edge, then we can construct a ISBN: 978-1-61804-147-0 275 Advances in Environment, Computational Chemistry and Bioscience 3.3 Protein complex interaction network Fig.2 is a view of a non cancer-related protein complex without PPI. In this case, there is not any interaction among subunits, therefore, no connection line is shown in this graph. By The powerful graphic capability of MATLAB makes it possible to visualize the cancer protein complex network. In this network, proteins and interactions are drawn as nodes and edges respectively. OCP, TSP and non-cancer proteins are represented by different shapes and presented by animations. We labeled each cancer class in PPI by different color and shape. On the other hand, recent experimental studies indicated that the protein complex can be visualized as a unit composed of the cores, modules and attachments. Core proteins are proteins that have relatively more interactions among themselves. For clarification, the visualization graph is divided into two parts, with PPI and without PPI. The results are displayed by animation and a web service has been set up which can be accessed at http://bioinfo.csie.nfu.edu.tw:8080/ProteinComplex/ Default.aspx. 4 Conclusion The results indicated that 248 (13.6%) and 36 (1.98%) protein complexes are enriched with cancer proteins at a 0.05 and 0.01 significant level respectively. It is also found that CRPC pair tends to seldom interact with one another, when compared to the interactions between CRPC and non-CRPC pair, the ratio is 0.124, which also means the observed pattern of interaction is suppressed relative to the expected value. Those complexes enriched with cancer proteins and the interactions between CRPCs are worthy to be further exploited their relations with cancer formulation mechanisms in the future. Acknowledgement The work of Chien-Hung Huang and Ka-Lok Ng is supported by the National Science Council of Taiwan under grants NSC 101-2221-E-150-088MY2 and NSC 100-2221-E-468-013, respectively. References: [1] Chan HH, Identification of novel tumorassociated gene (TAG) by bioinformatics analysis, MSc. Thesis, National Cheng Kung University, Taiwan. [2] Goh KI, Cusick ME, Valle D, Childs B, Vidal M and Barabási AL, The human disease network, Proc Natl Acad Sci U S A, Vol. 104, No. 21, 2007, pp. 8685-8690. [3] Guimarães KS, Jothi R, Zotenko E and Przytycka TM, Predicting domain-domain interactions using a parsimony approach, Genome Biol, Vol. 7, No. 11, 2006, pp. R104. [4] Güldener U, Münsterkötter M, Oesterheld M, Pagel P, Ruepp A, Mewes HW and Stümpflen V M, Pact: the MIPS protein interaction resource on yeast, Nucleic Acids Res, 34, 2006, pp. D436-441. [5] Jonsson PF and Bates PA Global topological features of cancer proteins in the human interactome, Bioinformatics, Vol. 22, No. 18, 2006, pp. 2291-2297. [6] Kar G, Gursoy A and Keskin O, Human cancer protein-protein interaction network: a structural Fig.1 A cancer-related protein complex with PPI among its subunits. Fig.1 is a cancer-related protein complex with PPI among its subunits. The different types of proteins are displayed by different shapes of nodes; that is, TCP, OCP and non cancer-related protein are represented by oval, square and circle, respectively. The number inside a node is the Uniprot ID of the corresponding protein. Fig 2. A non cancer related protein complex without PPI. ISBN: 978-1-61804-147-0 276 Advances in Environment, Computational Chemistry and Bioscience 285, No. 3, 2003, pp. F377-F387. [12] Schuster-Böckler B and Bateman A, Reuse of structural domain-domain interactions in protein networks, BMC Bioinformatics, 18;8:259, 2007. [13] Sugiura T and Berditchevski F, Function of alpha3beta1-tetraspanin protein complexes in tumor cell invasion. Evidence for the role of the complexes in production of matrix metalloproteinase 2 (MMP-2), J Cell Biol, Vol. 146, No. 6, 1999, pp. 1375-1389. [14] Xenarios I, Rice DW, Salwinski L, Baron MK, Marcotte EM and Eisenberg D, DIP: the database of interacting proteins, Nucleic Acids Res, Vol. 28, No. 1, 2000, pp. 289-291. perspective, PLoS Comput Biol, Vol. 5, No. 12, 2009, pp. e1000601. [7] Krycer JR, Pang CN and Wilkins MR, High throughput protein-protein interaction data: clues for the architecture of protein complexes, Proteome Sci, 6:32, 2008. [8] Lee H, Deng M, Sun F and Chen T, An integrated approach to the prediction of domain-domain interactions, BMC Bioinformatics, 7:269, 2006. [9] Mishra GR, Suresh M and Kumaran K et al., Human protein reference database--2006 update, Nucleic Acids Res, 34, 2006, pp. D411414. [10] Oti M, Snel B, Huynen MA and Brunner HG, Predicting disease genes using protein-protein interactions, J Med Genet, Vol. 43, No. 8, 2006, pp. 691-698. [11] Roh MH and Margolis B, Composition and function of PDZ protein complexes during cell polarization, Am J Physiol Renal Physiol, Vol. ISBN: 978-1-61804-147-0 277