Integrated Mining of PPI Networks: A Case for Ensemble Clustering Srinivasan Parthasarathy Department of Computer Science and Engineering The Ohio State University Joint work with Sitaram Asur and Duygu Ucar Copyright 2006, Data Mining Research Laboratory I. Preliminaries and Motivation Copyright 2006, Data Mining Research Laboratory Proteins • Central component of cell machinery and life – It is the proteins dynamically generated by a cell that execute the genetic program [Kahn 1995] • Proteins work with other proteins [Von Mering et al 2002] – Form large interaction networks typically refered to as protein-protein interaction (PPI) networks – Regulate and support each other for specific functionality or process Copyright 2006, Data Mining Research Laboratory Protein Protein Interaction Networks • Why analyze? – To fully understand cellular machinery, simply listing proteins is not enough – (clusters of) interactions need to be delineated as well [v.Mering 2002] • Understanding the organism – Protein function prediction • E.g. no functional annotations for one-third of baker’s yeast – Drug design • Goal: To find modular clusters Copyright 2006, Data Mining Research Laboratory Challenges in analyzing PPI Networks – Noisy data • False positives [Deane 2002], false negatives [Hsu 06] – Existence of Hub Nodes • Particularly problematic for standard clustering and graph partitioning algorithms -- lead to very large core clusters and not much else! – Proteins can be multi-faceted • Can belong to multiple functional groups – most clustering algorithms are hard – need for soft or fuzzy clustering – Data Integration Issues • Multiple Sources – 2-Hyrbid, Mass Spectrometry, genetic co-occurrence • Different targets – Y2H, Mass Spec – target binding – Gene co-occurrence – target functional • Different weaknesses (missing certain interactions) – Y2H – translation – mass-spectrometry – transport & sensing Copyright 2006, Data Mining Research Laboratory Ensemble Clustering • A useful approach to combine the results from multiple clustering arrangements into a single arrangement based on consensus [SG03] • Objective: Mapping between clusters obtained by different algorithms to a single clustering arrangement • Our hypothesis: Potentially offers a viable solution for problems simultaneously – Given nice theory in the context of classification it is likely to be particularly useful in a noisy environment. • A weak analogy to the audience vote in millionaire – Naturally handles arrangements produced from different sources or domain driven segmentation. Copyright 2006, Data Mining Research Laboratory Ensemble Clustering on PPI networks: Key Questions • What are the base clustering methods and arrangements to use in the context of interaction networks? – How to handle the influence of noise and hubs? • How do we scale to problems of the scale of interaction networks? • How do we address the issue of soft clustering? • How to address the issue of data integration? – Another day another time Copyright 2006, Data Mining Research Laboratory II. Ensemble Clustering Framework Copyright 2006, Data Mining Research Laboratory Birds-eye-view (coarse grained) Topology-based Similarity Metrics Scale-free graph x Clustering Algorithms y Clustering Arrangements xy base clustering arrangements (soft)Consensus Clustering Cluster Representation Final clusters Copyright 2006, Data Mining Research Laboratory Similarity Metrics • Central to any clustering algorithm • Key idea: – Leverage topological information to determine the similarity between two proteins in the interaction network – With ensemble approach we are not limited to one! • Metrics : – Clustering coefficient based (edge oriented, local) – Edge Betweenness based (edge oriented, global) – Neighborhood based (local, non-edge oriented) Copyright 2006, Data Mining Research Laboratory Clustering coefficient-based similarity • Clustering coefficient – "all-my-friends-know-each-other" property – Measures the interconnectivity of a node’s neighbors. 1 2 vi 5 3 4 • Clustering coefficient-based similarity of two connected nodes vi and vj – Measures the contribution of the edge between the nodes towards the clustering coefficient of the nodes Copyright 2006, Data Mining Research Laboratory vj 6 Edge betweenness-based similarity • Shortest path edge betweenness [Newman et al] – “I-am-between-every-pair” property – Computes the fraction of shortest paths passing through an edge 1 2 5 3 6 4 8 – Edges that lie between communities have high values of betweenness – Edge betweenness-based similarity Copyright 2006, Data Mining Research Laboratory 7 Neighborhood-based similarity • “my-friends-are-your-friends” property • Based on the number of common neighbors between nodes (Czekanowski-Dice metric [Brun et al, 2004]) where Int(i) = number of neighbors of node i 1 2 5 3 4 Copyright 2006, Data Mining Research Laboratory 6 Base Clustering • Base clustering algorithms : Different criteria – kMetis – Repeated bisections – Direct k-way partitioning • Topology-based similarity measures : weight interactions – Clustering coefficient-based – local, targets FP – Edge betweenness-based – global, targets FP – Neighborhood – local, potentially targets FN & FP • 3X3 = 9 arrangements (variance is good!) – K clusters per arrangement (K clusters) Copyright 2006, Data Mining Research Laboratory PCA-based Consensus Technique Cluster Purification Dimensionality Reduction Consensus Clustering Copyright 2006, Data Mining Research Laboratory Cluster Purification • Goal : Prune unreliable base clusters • Intra-cluster similarity measure where SP(i,j) represents shortest path between i and j • Low intra-cluster distance => high reliability • Remove clusters with low reliability Copyright 2006, Data Mining Research Laboratory Dimensionality Reduction • Cluster membership matrix to represent pruned base clusters • Dimensions likely to be high (9 X k) • Clustering inefficient for high-dimensional data – Distance metric computations do not scale well • Lot of noise and redundancy in the matrix • Solution : Reduce dimensions of the matrix – Apply logistic PCA – Variant of PCA for binary data (Schein et al, 2003) Copyright 2006, Data Mining Research Laboratory Consensus Clustering • Agglomerative Hierarchical Clustering – Bottom-up clustering algorithm – Begin with each point in a separate cluster – Iteratively merge clusters that are similar • Recursive Bisection (RBR) algorithm • Soft Clustering Variants – Find initial clusters using agglo or RBR – Assign points to multiple clusters based on similarity – Hub nodes have high propensity for multiple membership Copyright 2006, Data Mining Research Laboratory Topological Metrics Ensemble Framework (Detailed View) Base Clustering Base clustering arrangements Cluster Purification Consensus Clustering Weights Pruning Agglomerative Clustering Principal Component Analysis Final clusters Weighted Graph Soft PCA-agglo PCA-softvariants Copyright 2006, Data Mining Research Laboratory PCA-rbr III. Evaluation Copyright 2006, Data Mining Research Laboratory Validation Metrics: Domain Independant • Topological measure : Modularity [Newman&Girvan04] – Measures the modularity within clusters – dij represents fraction of edges linking nodes in clusters i and j • Information theoretic measure : Normalized Mutual Information [Strehl & Ghosh03] – Measures the shared information between the consensus and base clustering arrangements Copyright 2006, Data Mining Research Laboratory Validation Metric: Domain Dependant • Domain-based measure: – Gene ontology annotations for each cluster of proteins • Cellular Component • Molecular Function • Biological Process – P-value to measure statistical significance of clusters • Computes the probability of the grouping being random • Smaller p-values represent higher biological significance – Clustering Score to measure overall clustering arrangement Copyright 2006, Data Mining Research Laboratory Experimental Setup • Algorithms proposed by Strehl et al , 2003 – HyperGraph Partitioning Algorithm (HGPA) • Minimal Hyperedge Separator using HMetis – Meta-CLustering Algorithm (MCLA) • Group related hyperedges to form meta-clusters • Assign each point to the closest meta-cluster – Cluster-based Similarity Partitioning (CSPA) • Pairwise similarity matrix is partitioned with METIS • Algorithms proposed by Gionis et al, ICDE 2005 – Agglomerative algorithm (CE-agglo) – Density-based clustering algorithm (CE-balls) – Use strict thresholds and are non-parametric • Database of Interacting Proteins (DIP) – 4928 proteins, 17194 interactions Copyright 2006, Data Mining Research Laboratory Modularity and NMI CSPA algorithm ran out of memory CE-agglo and CE-balls algorithms resulted in pairs and singleton clusters (cluster-sizes 2121 and 2783 respectively) Algorithm Modularity NMI PCA-agglo 0.471 0.66 PCA-rbr 0.46 0.656 MCLA 0.41 0.614 HGPA 0.1 0.275 PCA-based consensus methods provide best scores! Copyright 2006, Data Mining Research Laboratory Comparison with Ensemble Algorithms Process Ensemble Algorithms Clustering Score Function Component 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 CE-balls CE-agglo HGPA PCA-agglo PCA-rbr MCLA Wt-agglo PCA-based Consensus methods outperform all other algorithms! MCLA performs best of the other algorithms Copyright 2006, Data Mining Research Laboratory Existing Solutions to Identify Dense Regions • Molecular Complex Detection (MCODE) – Bader et al, 2003 – Use local neighborhood density to identify seed vertices – Group highly weighted vertices around seed vertices • Markov Cluster Algorithm (MCL) – Dongen et al 2000 – Random walks on the graph will infrequently go from one natural cluster to another – Cluster structure separates out – Fast, scalable and non-parametric Copyright 2006, Data Mining Research Laboratory Comparison with MCODE and MCL • MCODE produced only 59 clusters – Not all proteins clustered (794/4928) – 10-20 clusters insignificant • MCL produced 1246 clusters Algorithm Modularity PCA-agglo 0.471 MCL 0.217 MCODE 0.372 – Most of the clusters insignificant (close to 75-80%) Copyright 2006, Data Mining Research Laboratory Soft Clustering: Comparison with Hub Duplication (Ucar 2006) For Hub H i Hub-induced Subgraph Si Dense components of Si Hi Duplicate Hi Hi i++ D’i Graph Partitioning Copyright 2006, Data Mining Research Laboratory Benefits of Soft Ensemble Clustering Copyright 2006, Data Mining Research Laboratory A closer look at soft clustering performance • CKA1 (hub protein) Base Algorithm Annotation PCA-agglo PCA-softagglo Direct-bet Kinase CK2 complex Kinase CK2 complex Kinase CK2 complex Direct-cc rRNA metabolism rRNA metabolism RBR-bet Kinase CK2 complex Cell organization and biogenesis RBR-cc Kinase CK2 complex Metis-bet Cell organization and biogenesis Metis-cc Copyright 2006, Data Mining Research Laboratory Concluding Remarks Clustering PPI networks is • Ongoing work – General applicability challenging • – – – – • Noise Presence of hubs Need for soft clustering Integration Ensemble clustering shows promise as a unified method to handle these problems – – Competes well against existing stand-alone solutions Scalable -- straightforward parallelization for the most part • WWW applications • Social network analysis – Explicit modeling of domain knowledge • E.g. encoding directionality – Data Integration • Key is to weight edges and/or components of the ensemble – Leveraging graphical models – More robust base models • Extrinsic similarity measures • Impact of anomalies Copyright 2006, Data Mining Research Laboratory Questions? • We acknowledge the following grants for support – – – – NSF: CAREER-IIS-0347662 NSF: NGS-CNS-0406386 NSF: RI-CNS-0403342 DOE: ECPI-FG02 • Graduate Student Colleagues – S. Asur and D. Ucar • Details – http://dmrl.cse.ohio-state.edu – www.cse.ohio-state.edu/~srini/ Copyright 2006, Data Mining Research Laboratory