LECTURE 3 Introduction to PCA and PLS K-mean clustering Protein function prediction using network concepts Network Centrality measures Handling Multivariate data Student Math Chem Phy Bio Eco Soc A 7 8 7 8 7 7 B 8 7 7 6 8 7 C 9 7 8 7 6 7 D 7 7 7 7 9 8 E 7 6 6 6 8 8 F 7 7 7 7 8 8 G 6 6 6 7 7 7 H 9 8 8 6 6 6 I 8 8 8 7 6 6 J 7 7 6 6 8 9 Multivariate data example Principle Component Analysis (PCA) and Partial Least Square (PLS) • Two major common effects of using PCA or PLS Convert a group of correlated predictive variables to a group of independent variables Construct a “strong” predictive variable from several “weaker” predictive variables • Major difference between PCA and PLS PCA is performed without a consideration of the target variable. So PCA is an unsupervised analysis PLS is performed to maximiz the correlation between the target variable and the predictive variables. So PLS is a supervised analysis PCA A (n x p) PLS X (n x p) 1 PC (n x p) T (n x c) Y (n x q) 2 1 U (n x c) max cov. 1 Decomposition step 2 Regression step A = data matrix PC = principal component matrix n = # of observations p = # of variables X = matrix of predictors Y = matrix of responses T = factors of predictors U = factors of responses n = # of observations p = # of predictors q = # of responses c = # of extracted factors Principle Component Analysis (PCA) In Principal Component Analysis, we look for a few linear combinations of the predictive variables which can be used to summarize the data without loosing too much information. Intuitively, Principal components analysis is a method of extracting information from a higher dimensional data by projecting it to a lower dimension. Example: Consider the scatter plot of a 3-dimentional data (3 variables). Data across the 3 variables are higly correlated and majority of the points cluster around the center of the space. This is also the direction of the 1st PC, which roughly gives equal weight to 3 variables PC1 = – 0.56 X1 – 0.57 X2 – 0.59 X3 Properties of Principal Components • Var(PCi) = i • Cov(PCi,PCj) = 0 • Var(PC1) Var(PC2) … Var(PCp) Numerical Example Student Math Chem Phy Bio Eco Soc A 7 8 7 8 7 7 B 8 7 7 6 8 7 C 9 7 8 7 6 7 D 7 7 7 7 9 8 E 7 6 6 6 8 8 F 7 7 7 7 8 8 G 6 6 6 7 7 7 H 9 8 8 6 6 6 I 8 8 8 7 6 6 J 7 7 6 6 8 9 The following is the high school grade of 10 students on 6 subjects (scale 1-10) • Math = Mathematics • Chem = Chemistry • Phy = Phisics • Bio = Biology • Eco = Economy • Soc = Sociology Results PC1 PC2 PC3 PC4 PC5 PC6 Eigenvalue 3.020 0.708 0.497 0.219 0.167 0.023 Proportion 0.652 0.153 0.107 0.047 0.036 0.005 Cumulative 0.652 0.804 0.912 0.959 0.995 1 Eigenvectors Math 0.461 0.621 -0.088 0.168 0.267 -0.542 Chem 0.302 -0.059 -0.594 0.016 -0.740 -0.074 Phy 0.428 0.110 -0.365 -0.064 0.386 0.720 Bio 0.054 -0.666 -0.410 0.248 0.445 -0.355 Eco -0.533 0.271 -0.526 -0.559 0.185 -0.140 Soc -0.475 0.286 -0.248 0.771 -0.020 0.192 Partial Least Squares (PLS) • Unlike PCA, the PLS technique works by successively extracting factors from both predictive and target variables such that covariance between the extracted factors is maximized • Decomposition step X = TWt + E Y = UVt + F • Regression step Y = TB + D = XWB + D = XBPLS + D; BPLS = WB Numerical Example Student Math Chem Phy Bio Eco Soc GPA A 7 8 7 8 7 7 2.9 B 8 7 7 6 8 7 3.1 C 9 7 8 7 6 7 3.6 D 7 7 7 7 9 8 3.3 E 7 6 6 6 8 8 3.0 F 7 7 7 7 8 8 2.9 G 6 6 6 7 7 7 3.2 H 9 8 8 6 6 6 3.4 I 8 8 8 7 6 6 2.8 J 7 7 6 6 8 9 3.5 The following is the high school grade of 10 students on 6 subjects (scale 1-10) • Math = Mathematics • Chem = Chemistry • Phy = Phisics • Bio = Biology • Eco = Economy • Soc = Sociology and the corresponding GPA score during undergraduate level. Objective: Can we use information of student’s performance during high school to predict their GPA score when they enter undergraduate level? K-mean clustering Source: “Clustering Challenges in Biological Networks” edited by S. Butenko et. al. Source: Teknomo, Kardi. K-Means Clustering Tutorials http:\\people.revoledu.com\kardi\ tutorial\kMean\ 1. Initial value of centroids: Suppose we use medicine A and medicine B as the first centroids. Let c1 and c2 denote the coordinate of the centroids, then c1 = (1,1) and c2 = (2,1) Protein function prediction using network concepts Topology of Protein-protein interaction is informative but further analysis can reveal other information. A popular assumption, which is true in many cases is that similar function proteins interact with each other. Based on these assumption, we have developed methods to predict protein functions and protein complexes from the PPI networks mainly based on cluster analysis. Cluster Analysis Cluster Analysis, also called data segmentation, implies grouping or segmenting a collection of objects into subsets or "clusters", such that those within each cluster are more closely related to one another than objects assigned to different clusters. In the context of a graph densely connected nodes are considered as clusters Visually we can detect two clusters in this graph K-cores of Protein-Protein Interaction Networks Definition Let, a graph G=(V, E) consists of a finite set of nodes V and a finite set of edges E. A subgraph S=(V, E) where V V and E E is a k-core or a core of order k of G if and only if v V: deg(v) k within S and S is the maximal subgraph of this property. Concept of a k-core graph Graph G 1-core graph: The degree of all nodes are one or more Concept of a k-core graph 1-core graph: The degree of all nodes are one or more Concept of a k-core graph 2-core graph: The degree of all nodes are two or more Concept of a k-core graph 1-core graph: The degree of all nodes are one or more Graph G 3-core graph: The degree of all nodes are three or more The 3-core is the highest k-core subgraph of the graph G Application of a k-core graph Analyzing protein-protein interaction data obtained from different sources, G. D. Bader and C.W.V. Hogue, Nature biotechnology, Vol 20, 2002 Protein function prediction using k-core graphs Introduction : Function prediction Schwikowski, B., Uetz, P. and Fields, S. A network of proteinprotein interactions in yeast. Nature Biotech. 18, 1257-1261 (2000) Deals with a network of 2039 proteins and 2709 interactions. 65% of interactions occurred between protein pairs with at least one common function Hishigaki, H., Nakai, K., Ono, T., Tanigami, A., and Tagaki, T. Assessment of prediction accuracy of protein function from protein-protein interaction data. Yeast 18, 523-531 (2001) Reported similar results.. Introduction : Function prediction Hypothesis Unknown function proteins that form densely connected subgraph with proteins of a particular function may belong to that CLASS A functional group. UNCLASSIFIED PROTEINS We utilize this concept by determining k-cores of strategically constructed sub-networks. 33 Prediction of Protein Functions Based on K-cores of Protein-Protein Interaction Networks “Prediction of Protein Functions Based on K-cores of Protein-Protein Interaction Networks and Amino Acid Sequences”, Md. Altaf-Ul-Amin, Kensaku Nishikata, Toshihiro Koma, Teppei Miyasato, Yoko Shinbo, Md. Arifuzzaman, Chieko Wada, Maki Maeda, Taku Oshima, Hirotada Mori, Shigehiko Kanaya The 14th International Conference on Genome Informatics December 14-17, 2003, Yokohama Japan. E.Coli PPI network Total 3007 proteins and 11531 interactions Around 2000 are unknown function proteins Highest K-core of this total graph is not so helpful 10-core graph—the highest k-core of the E.Coli PPI network We separate 1072 interactions (out of 11531) involving protein synthesis and function unknown proteins. P. S. P. S. U. F. P. S. Function unknown Proteins of this 6-kore graph are likely to be involved in protein synthesis Unknown Extending the k-core based function prediction method and its application to PPI data of Arabidopsis thaliana Protein Function Prediction based on k-cores of Interaction Networks, Norihiko Kamakura, Hiroki Takahashi, Kensuke Nakamura, Shigehiko Kanaya and Md. Altaf-Ul-Amin, Proceedings of 2010 International Conference on Bioinformatics and Biomedical Technology (ICBBT 2010) Materials and Methods : Dataset All PPI data of Arabidopsis thaliana •3118 interactions involving 1302 proteins. • Collected from databases and scientific literature by our laboratory. Green= Unknown proteins (289 proteins) Pink= Known proteins (1013 proteins) 40 Materials and Methods : Dataset Functional groups in the network The PPI dataset contains proteins of 19 different functions according to the first level categories of the KNApSAcK database. function names CELL CYCLE AND DNA PROCESSING CELL FATE CELL RESCUE, DEFENSE AND VIRULENCE CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION MECHANISM CONTROL OF CELLULAR ORGANIZATION DEVELOPMENT (Systemic) ENERGY Endoplasmic reticulum biogenesis METABOLISM Mitochondria biogenesis PROTEIN ACTIVITY REGULATION PROTEIN FATE (folding, modification, destination) PROTEIN SYNTHESIS REGULATION OF / INTERACTION WITH CELLULAR ENVIRONMENT STORAGE PROTEIN SYSTEMIC REGULATION OF / INTERACTION WITH ENVIRONMENT TRANSCRIPTION TRANSPORT FACILITATION UNCLASSIFIED PROTEINS number of proteins 69 5 32 171 3 9 51 4 120 4 1 112 20 1 1 2 362 46 289 41 Materials and Methods : Dataset The trends of interactions in the context of functional similarity Diagonal elements show number of interactions between similar function proteins. function name No No 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 METABOLISM 1 72 23 1 9 10 0 1 0 67 0 29 0 4 3 0 0 0 0 0 UNCLASSIFIED PROTEINS 2 23 82 19 166 279 9 3 4 189 0 35 0 35 16 0 0 0 0 1 CELL RESCUE, DEFENSE AND VIRULENCE 3 1 19 9 15 7 0 0 0 38 0 1 0 3 4 0 0 0 0 0 TRANSCRIPTION 4 9 166 15 689 64 6 1 0 354 0 2 3 22 7 0 0 0 1 0 PROTEIN FATE (folding, modification, destination) 5 10 279 7 64 137 0 9 2 20 0 22 2 7 5 0 0 0 0 0 DEVELOPMENT (Systemic) 6 0 9 0 6 0 1 0 0 1 0 0 0 0 2 0 0 0 0 0 CELL FATE 7 1 3 0 1 9 0 1 0 2 0 0 0 0 1 0 0 0 0 0 PROTEIN SYNTHESIS 8 0 4 0 0 2 0 0 17 2 0 1 0 1 1 0 0 0 0 0 CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION MECHANISM9 67 189 38 354 20 1 2 2 374 0 24 0 35 11 0 0 1 1 0 Mitochondria biogenesis 10 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 ENERGY 11 29 35 1 2 22 0 0 1 24 0 64 0 3 8 0 0 0 0 0 SYSTEMIC REGULATION OF / INTERACTION WITH ENVIRONMENT 12 0 0 0 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CELL CYCLE AND DNA PROCESSING 13 4 35 3 22 7 0 0 1 35 0 3 0 44 2 2 0 0 0 0 TRANSPORT FACILITATION 14 3 16 4 7 5 2 1 1 11 0 8 0 2 17 0 2 0 0 3 CONTROL OF CELLULAR ORGANIZATION 15 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 REGULATION OF / INTERACTION WITH CELLULAR ENVIRONMENT 16 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 PROTEIN ACTIVITY REGULATION 17 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 STORAGE PROTEIN 18 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 Endoplasmic reticulum biogenesis 19 0 1 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 6 42 Materials And Methods : Flowchart of the method Input: A PPI network Make a sub-network corresponding to a functional group Remove the components consisting of only unknown proteins Determine k-cores and assign the corresponding function to the unknown proteins included in the k-cores(for k =3 or more) Output: Predicted functions for some unknown proteins 43 Results : Subnetworks Subnetwork Name Number of interactions we do not consider in this work the sub-networks that contain less than 100 interactions. And finally I consider subnetworks corresponding to 9 functional classes. 44 Results : Subnetwork corresponding to cellular communication As an example here we show the subnetworks and k-cores corresponding to cellular communication. Subnetwork extraction We extracted the following 3 types of interactions. Cellular communication-Cellular communication Cellular communication-Unknown, Unknown-Unknown Total 603 interactions 45 Results : Subnetwork corresponding to cellular communication 1-core The red nodes : known proteins. The green nodes : unknown proteins. 46 Results : k-cores corresponding to cellular communication 2-core 3-core The red nodes : known proteins. The green nodes : unknown proteins. The red color nodes represent known proteins, the green color nodes represent function unknown proteins. 47 Results : k-cores corresponding to cellular communication 4-core 5-core The red nodes : known proteins The green nodes : unknown proteins. 6-core 7-core This figure implies that determination of k-cores in strategically constructed sub-networks can reveal which unknown proteins are densely connected to proteins of a particular functional class. 48 Results : Function Predictions The number of unknown genes included in different k-cores corresponding to different functional groups k-core 2 cell_cycle 11 cell_rescue 4 cellular_communicati on k-core 3 k-core 4 k-core 5 k-core 6 k-core 7 k-core 8 7 37 33 23 15 12 8 energy 5 2 2 2 2 2 metabo 5 1 1 69 35 25 25 15 10 24 14 11 8 8 88 64 52 36 27 protein_fate protein_synthesis transcription transport_facilitation total 2 2 33 2 129 2 49 Results : Function Predictions Prediction based on 2-cores, 3-cores and 4-cores 2-core 4-core Most proteins have been assigned unique functions CELL CYCLE AND DNA PROCESSING CELLULAR COMMUNICATIO/SIGNAL TRANDUCTION CELL RESCUEM, SEFENSE AND VIRULENCE ENERGY 3-core Most proteins have been assigned unique functions and some have been assigned multiple functions METABOLISM PROTEIN FATE (folding, modification, destination) PROTEIN SYNTHESIS TRANSCRIPTION TRANSPORT FACILITATION 50 Assessment of Predictions As most of the function predicted proteins are still unknown their annotations do not contain clear information on their functions. When k is much larger than one, the effect of false positives is greatly reduced. However to assess statistically, we constructed 1000 random graphs consisting of the same 1,302 proteins but I inserted 3,118 edges randomly and constructed subnetworks. 51 Assessment of Predictions Cell Cycle Energy Protein Synthesis Cell Rescue Metabolism Transcription Cellular Communication Protein fate Transport The box plots show the distribution of k-cores with respect to their size in 1000 graphs corresponding to each sub-network and the filled triangles show the size of k-cores in real PPI sub-networks. Assessment of Predictions •it can be theoretically concluded that the existence of higher order k-core graphs in PPI sub-networks compared to in the random graphs of the same size are likely to be because of interaction between similar function proteins. •Therefore we assume that the function prediction based on kcores for the value of k greater than highest possible value of k for corresponding random graphs are statistically significant predictions. • Based on this we predicted the functions of 67 proteins(list is available online at http://kanaya.naist.jp/Kcore/supplementary/Function_prediction .xls. 53 53 “Prediction of Protein Functions Based on ProteinProtein Interaction Networks: A Min-Cut Approach”, Md. Altaf-Ul-Amin, Toshihiro Koma, Ken Kurokawa, Shigehiko Kanaya, Proceedings of the Workshop on Biomedical Data Engineering (BMDE), Tokyo, Japan, pp. 37-43, April 3-4, 2005. Outline •Introduction •The concept of Min-Cut •Problem Formulation •A Heuristic Method •Evaluation of the Proposed Method •Conclusions Outline •Introduction •The concept of Min-Cut •Problem Formulation •A Heuristic Method •Evaluation of the Proposed Method •Conclusions Introduction After the complete sequencing of several genomes, the challenging problem now is to determine the functions of proteins 1) Determining protein functions experimentally 2) Using various computational methods a) sequence b) structure c) gene neighborhood d) gene fusions e) cellular localization f) protein-protein interactions Introduction Present work predicts protein functions based on proteinprotein interaction network. For the purpose of prediction, we consider the interactions of •function-unknown proteins with function-known proteins and • function-unknown proteins with function-unknown proteins In the context of the whole network. Introduction Majority of protein-protein interactions are between similar function protein pairs. Therefore, We assign function-unknown proteins to different functional groups in such a way so that the number of inter-group interactions becomes the minimum. Hence we call the proposed approach a Min-Cut approach. Outline •Introduction •The concept of Min-Cut •Problem Formulation •A Heuristic Method •Evaluation of the Proposed Method •Conclusions The concept of Min-Cut U4 K1 U3 K4 K2 U2 K3 U1 G1 K8 K6 K5 G2 A typical and small network of known and unknown proteins The concept of Min-Cut U4 K U3 K K K U2 K K U1 K G1 G2 Unknown proteins assigned to known groups based on majority interactions The concept of Min-Cut U4 K U3 K K K U2 K K U1 K G1 G2 Number of CUT = 4 The concept of Min-Cut U4 K U3 K K K U2 K K U1 K G1 G2 An alternative assignment of unknown proteins The concept of Min-Cut U4 K U3 K K K U2 K K U1 K G1 Number of CUT = 2 G2 For every assignment of unknown proteins, there is a value of CUT. Min-cut approach looks for an assignment for which the number of CUT is minimum. Outline •Introduction •The concept of Min-Cut •Problem Formulation •A Heuristic Method •Evaluation of the Proposed Method •Conclusions Problem Formulation Let G , G ,……..,G are n sets/groups of functionknown proteins such that all proteins of a group are of similar function. Multiple function proteins are members of more than one group. Therefore, the set of all function-known proteins G G . The set of function-unknown proteins is denoted byU . N (V , E ) is a graph/network where v V is a node representing a protein and e (v , v ) E is an edge representing……. 1 2 n n k 1 k i ij i j Here we explain some points with a typical example. Problem Formulation U8 K9 U7 N (V , E ) K10 U6 G3 K8 V= set of all nodes U5 E =set of all edges U4 K1 U3 K3 K2 K7 U2 K4 K6 U1 K5 G1 G2 G={K1, K2, K3, K4, K5, K6, K7, K8, K9, K10} U={U1, U2, U3, U4, U5, U6, U7, U8} Problem Formulation U8 We generate U´ U such that each protein of U´ is connected in N with at least one protein of group G by a path of length 1 or length 2. K9 U7 K10 U6 G3 K8 U5 U4 K1 K3 K2 K7 U3 U2 K4 K6 U1 K5 G1 G2 U´= {U1, U2, U3, U4, U5, U6, U7} Problem Formulation U8 K9 U7 K10 U6 G3 K8 We can assign proteins of U´ to different groups and calculate CUT U5 U4 K1 K3 K2 K7 U3 U2 K4 K6 U1 Interactions between known protein pairs can never be part of CUT K5 G1 G2 For this assignment of unknown proteins, the CUT= 6 Problem Formulation The problem we are trying to solve is to assign the proteins of set U´ to known groups G1 , G2 ,…….., G3 in such a way so that the CUT becomes the minimum. Outline •Introduction •The concept of Min-Cut •Problem Formulation •A Heuristic Method •Evaluation of the Proposed Method •Conclusions A Heuristic Method •The problem under hand is a variant of network partitioning problem. •It is known that network partitioning problems are NP-hard. •Therefore, we resort to some heuristics to find a solution as better as it is possible. A Heuristic Method min_cut = |E| iteration = 0 Make a table for each protein of U containing maximum 3 IDs of respective priority groups U1 Assign each protein of Uto some randomly or intentionally chosen group from among its priority groups U2 U3 Calculate CUT CUT < min_cut YES min_cut = CUT Record the current assignment NO iteration = iteration + 1 YES iteration < max_value NO Print min_cut, corresponding assignment and Exit U4 U5 U6 U7 A Heuristic Method U8 K9 U7 U6 G3 K8 U5 K7 U3 K3 K2 U2 U3 U4 K1 U1 G2 G1 x K10 U2 K4 U5 K6 U1 K5 G1 U4 U6 U7 G2 U1 has one path of length 1 with G2 and two paths of length two with G1 A Heuristic Method U8 K9 U7 U6 G3 K8 U5 K7 U3 K3 K2 U2 G2 G1 x U3 G2 G1 x U4 K1 U1 G2 G1 x K10 U2 K4 U5 K6 U1 K5 G1 U4 G1 G2 G3 U6 U7 G2 U4 has two paths of length 1 with G1, one path of length one with G2 and one path of length two with G3. A Heuristic Method U8 K9 U7 U6 G3 K8 U5 K7 U3 K3 K2 U2 G2 G1 x U3 G2 G1 x U4 K1 U1 G2 G1 x K10 U2 K4 U5 G1 G2 G3 K6 U1 K5 G1 G2 U4 G1 G2 G3 U6 G1 G3 G2 U7 G3 G2 x A Heuristic Method min_cut = |E| iteration = 0 U1 G2 G1 x Make a table for each protein of U containing maximum 3 IDs of respective priority groups U2 G2 G1 x Assign each protein of Uto some randomly or intentionally chosen group from among its priority groups U4 G1 G2 G3 Calculate CUT CUT < min_cut U3 G2 G1 x YES min_cut = CUT Record the current assignment U5 G1 G2 G3 NO U6 G1 G3 G2 iteration = iteration + 1 U7 G3 G2 x YES iteration < max_value NO Print min_cut, corresponding assignment and Exit A Heuristic Method U8 K9 U7 U6 G3 K8 U5 K7 U3 K3 K2 U2 G2 G1 x U3 G2 G1 x U4 K1 U1 G2 G1 x K10 U2 K4 K6 U4 G1 G2 G3 U5 G1 G2 G3 U6 G1 G3 G2 U1 K5 G1 U7 G3 G2 x G2 By assigning all the unknown proteins to respective height priority groups, CUT = 6 A Heuristic Method U8 K9 U7 U6 G3 K8 U5 K7 U3 K3 K2 U2 G2 G1 x U3 G2 G1 x U4 K1 U1 G2 G1 x K10 U2 K4 U5 G1 G2 G3 K6 U1 K5 G1 U4 G1 G2 G3 U6 G1 G3 G2 U7 G3 G2 x G2 For this assignment of unknown proteins, the CUT= 7 A Heuristic Method U8 K9 U7 U6 G3 K8 U5 K7 U3 K3 K2 U2 G2 G1 x U3 G2 G1 x U4 K1 U1 G2 G1 x K10 U2 K4 U5 G1 G2 G3 K6 U1 K5 G1 U4 G1 G2 G3 U6 G1 G3 G2 U7 G3 G2 x G2 For this assignment of unknown proteins, the CUT= 4 Outline •Introduction •The concept of Min-Cut •Problem Formulation •A Heuristic Method •Evaluation of the Proposed Method •Conclusions Evaluation of the Proposed Approach •The proposed method is a general one and can be applied to any organism and any type of functional classification. •Here we applied it to yeast Saccharomyces cerevisiae protein-protein interaction network •We obtain the protein-protein interaction data from ftp://ftpmips.gsf.de/yeast/PPI/ which contains 15613 genetic and physical interactions. Evaluation of the Proposed Approach We discard selfinteractions and extract a set of 12487 unique binary interactions involving 4648 proteins. YAR019c YMR001c YAR019c YNL098c YAR019c YOR101w YAR019c YPR111w YAR027w YAR030c YAR027w YBR135w YAR031w YBR217w ------------- ------------- ------------- ------------- Total 12487 pairs Evaluation of the Proposed Approach A network of 12487 interactions and 4648 proteins is reasonably big Evaluation of the Proposed Approach We collect from http://mips.gsf.de/genre/proj/yeast/index.jsp the classification data Name of functional class METABOLISM ENERGY CELL CYCLE AND DNA PROCESSING TRANSCRIPTION PROTEIN SYNTHESIS PROTEIN FATE (folding, modification, destination) PROTEIN WITH BINDING FUNCTION OR COFACTOR REQUIREMENT (structural or catalytic) PROTEIN ACTIVITY REGULATION CELLULAR TRANSPORT, TRANSPORT FACILITATION AND TRANSPORT ROUTES CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION MECHANISM CELL RESCUE, DEFENSE AND VIRULENCE INTERACTION WITH THE CELLULAR ENVIRONMENT TRANSPOSABLE ELEMENTS, VIRAL AND PLASMID PROTEINS BIOGENESIS OF CELLULAR COMPONENTS CELL TYPE DIFFERENTIATION # of proteins 984 260 690 842 381 631 39 27 719 94 296 336 118 451 339 Evaluation of the Proposed Approach Name of functional class METABOLISM ENERGY CELL CYCLE AND DNA PROCESSING TRANSCRIPTION PROTEIN SYNTHESIS PROTEIN FATE (folding, modification, destination) PROTEIN WITH BINDING FUNCTION OR COFACTOR REQUIREMENT (structural or catalytic) PROTEIN ACTIVITY REGULATION CELLULAR TRANSPORT, TRANSPORT FACILITATION AND TRANSPORT ROUTES CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION MECHANISM CELL RESCUE, DEFENSE AND VIRULENCE INTERACTION WITH THE CELLULAR ENVIRONMENT TRANSPOSABLE ELEMENTS, VIRAL AND PLASMID PROTEINS BIOGENESIS OF CELLULAR COMPONENTS CELL TYPE DIFFERENTIATION # of proteins 984 260 690 842 381 631 39 27 719 94 296 336 118 451 339 •The proposed approach is intended to predict the functions of function-unknown proteins. •However, by predicting the functions of function-unknown proteins, it is not possible to determine the correctness of the predictions. •We consider around 10% randomly selected proteins of each group of Table 1 as function-unknown proteins. Evaluation of the Proposed Approach Name of functional class METABOLISM ENERGY CELL CYCLE AND DNA PROCESSING TRANSCRIPTION PROTEIN SYNTHESIS PROTEIN FATE (folding, modification, destination) PROTEIN WITH BINDING FUNCTION OR COFACTOR REQUIREMENT (structural or catalytic) PROTEIN ACTIVITY REGULATION CELLULAR TRANSPORT, TRANSPORT FACILITATION AND TRANSPORT ROUTES CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION MECHANISM CELL RESCUE, DEFENSE AND VIRULENCE INTERACTION WITH THE CELLULAR ENVIRONMENT TRANSPOSABLE ELEMENTS, VIRAL AND PLASMID PROTEINS BIOGENESIS OF CELLULAR COMPONENTS CELL TYPE DIFFERENTIATION # of proteins 984 260 690 842 381 631 39 27 719 94 296 336 118 451 339 •The union of 10% of all groups consists of 604 proteins. This is the unknown group U. •The union of the rest 90% of each of the functional groups constitutes the set of known proteins G. There are total 3783 proteins in G. •We generate U´ U such that each protein of U´ is connected in N with at least one protein of group G by a path of length 1 or length 2. There are 470 proteins in U´ . •We predicted functions of these 470 proteins using the proposed method. Evaluation of the Proposed Approach min_cut = |E| iteration = 0 Make a table for each protein of U containing maximum 3 IDs of respective priority groups Assign each protein of Uto some randomly or intentionally chosen group from among its priority groups Calculate CUT CUT < min_cut YES min_cut = CUT Record the current assignment NO iteration = iteration + 1 YES iteration < max_value NO Print min_cut, corresponding assignment and Exit We applied this algorithm using Max_value=50000 to predict the functions 470 proteins. Evaluation of the Proposed Approach •We cannot guarantee that minimum CUT corresponds to maximum successful prediction. •However, the trends of the results of the Figure above shows that it is very likely that the lower is the value of CUT the greater is the number of successful predictions Evaluation of the Proposed Approach We then examine the relation of successful predictions with the number of degrees of the proteins in the network . U8 K9 U7 K10 U6 G3 K8 U5 U4 K1 K2 K7 U3 K3 U2 K4 K6 U1 K5 G1 G2 Degree of U4 =7 Degree of U7=3 Evaluation of the Proposed Approach We then examine the relation of successful predictions with the number of degrees of the proteins in the network . Evaluation of the Proposed Approach Degree 1 2 3 4 5 6 7 >7 Total Number of proteins 128 80 60 33 23 24 17 105 470 Successful prediction 39 39 32 24 15 14 12 71 246 •The success rate of prediction is as low as 30.46% for proteins that have only one degree in the interaction network. Percentage 30.46 48.75 53.33 72.72 65.21 58.33 70.58 67.61 52.34 •However it is 67.61% for proteins that have degrees 8 or more. 100 •This implies that the reliability of the prediction can be improved by providing reasonable amount of interaction information Success Percentage 80 60 40 20 0 0 1 2 3 4 Degree 5 6 7 8 Centrality measures of nodes Centrality measures Within graph theory and network analysis, there are various measures of the centrality of a vertex within a graph that determine the relative importance of a vertex within the graph. We will discuss on the following centrality measures: •Degree centrality •Betweenness centrality •Closeness centrality •Eigenvector centrality •Subgraph centrality Degree centrality Degree centrality is defined as the number of links incident upon a node i.e. the number of degree of the node Degree centrality is often interpreted in terms of the immediate risk of the node for catching whatever is flowing through the network (such as a virus, or some information). Degree centrality of the blue nodes are higher Betweenness centrality The vertex betweenness centrality BC(v) of a vertex v is defined as follows: Here σuw is the total number of shortest paths between node u and w and σuw(v) is number of shortest paths between node u and w that pass node v Vertices that occur on many shortest paths between other vertices have higher betweenness than those that do not. Betweenness centrality σuw a c d b f e Betweenness centrality of node c=6 Betweenness centrality of node a=0 σuw(v) σuw/σuw(v) (a,b) 1 0 0 (a,d) 1 1 1 (a,e) 1 1 1 (a,f) 1 1 1 (b,d) 1 1 1 (b,e) 1 1 1 (b,f) 1 1 1 (d,e) 1 0 0 (d,f) 1 0 0 (e,f) 1 0 0 Calculation for node c Betweenness centrality •Nodes of high betweenness centrality are important for transport. •If they are blocked, transport becomes less efficient and on the other hand if their capacity is improved transport becomes more efficient. •Using a similar concept edge betweenness is calculated. Hue (from red=0 to blue=max) shows the node betweenness. http://en.wikipedia.org/wiki/Between ness_centrality#betweenness Closeness centrality The farness of a vortex is the sum of the shortest-path distance from the vertex to any other vertex in the graph. The reciprocal of farness is the closeness centrality (CC). 1 CC (v) d ( v, t ) t V \ v Here, d(v,t) is the shortest distance between vertex v and vertex t Closeness centrality can be viewed as the efficiency of a vertex in spreading information to all other vertices Eigenvector centrality Let A is the adjacency matrix of a graph and λ is the largest eigenvalue of A and x is the corresponding eigenvector then -----(1) N×N N×1 |A-λI|=0, where I is an identity matrix N×1 The ith component of the eigenvector x then gives the eigenvector centrality score of the ith node in the network. From (1) xi 1 N A j 1 i, j xj •Therefore, for any node, the eigenvector centrality score be proportional to the sum of the scores of all nodes which are connected to it. •Consequently, a node has high value of EC either if it is connected to many other nodes or if it is connected to others that themselves have high EC Subgraph centrality the number of closed walks of length k starting and ending on vertex i in the network is given by the local spectral moments μ k (i), which are simply defined as the ith diagonal entry of the kth power of the adjacency matrix, A: Subgraph Centrality in Complex Networks, Physical Review E 71, 056103(2005) Closed walks can be trivial or nontrivial and are directly related to the subgraphs of the network. Subgraph centrality 01000000000000 10110100000000 01011100000000 01101101000000 00110100000000 01111010000000 M= 00000100001000 00010000100000 Muv = 1 if there is an edge between nodes u and v and 0 otherwise. 00000001010011 00000000101011 00000010010000 00000000000010 00000000110101 00000000110010 Adjacency matrix Subgraph centrality 10110100000000 04223211000000 12432311000000 12352310100000 03223211000000 12332501001000 M2 = 01111020010000 01101102010011 (M2)uv for uv represents the number of common neighbor of the nodes u and v. 00010000421122 local spectral moment 00000000110101 00000011240122 00000100102011 00000001221042 00000001221123 Subgraph centrality The subgraph centrality of the node i is given by Let λ be the main eigenvalue of the adjacency matrix A. It can be shown that Thus, the subgraph centrality of any vertex i is bounded above by Table 2. Summary of results of eight real-world complex networks. Software Open Access Exploration of biological network centralities with CentiBiN Björn H Junker, Dirk Koschützki* and Falk Schreiber Address: Department of Molecular Genetics, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Corrensstr. 3, 06466 Gatersleben, Germany Email: Björn H Junker - junker@ipk-gatersleben.de; Dirk Koschützki* - koschuet@ipk-gatersleben.de; Falk Schreiber schreibe@ipkgatersleben. de BMC Bioinformatics 2006, 7:219 doi:10.1186/1471-2105-7-219