Network-based analysis of functional genomics data Jianhua Ruan, PhD Computer Science Department University of Texas at San Antonio http://www.cs.utsa.edu/~jruan Final project Final Project Report due Sat, Dec 15 Presentations: Mon, Dec 17, 8-10:30 pm 10 teams to present Each team will have up to 13 minutes. (10 min presentation, 3 min questions) Since time is limited, you don’t need to cover all the details in your presentation. Focus on the most important concepts More details in your project report UTSA Human: ~22,000 genes Dog: ~20,000 genes Rice: ~35,000 genes C. Elegans: ~20,000 genes Mouse: ~22,000 genes It is not just the genes, but the networks! Why networks? For complex systems, the actual output may not be predictable by looking at only individual components: The whole is greater than the sum of its parts Studying genes/proteins on the network level allows us to: Assess the role of individual genes/proteins in the overall pathway Evaluate redundancy of network components Identify candidate genes involved in genetic diseases Sets up the framework for mathematical models UTSA Graph model of biological networks An abstract of the complex relationships among molecules in the cell Vertex: molecule Edges: relationships Gene, protein, metabolite, DNA, RNA Physical interaction Functional association Share many common statistical properties with (Jeong et al., 2001) real-world networks Small-world Scale-free Hierarchical Modular (community structure) UTSA Genetic network reverse-engineering Data Mining, tree models, DNA motif finding Network analysis algorithms Community discovery Network-based disease studies Research Overview Data integration, classification, graph algorithms Agenda Community discovery in biological networks Network-based analysis of microarray data Network-based biomarker discovery for metastatic breast cancer Conclusions UTSA Network communities Communities Are relatively densely connected sub-networks (modules) Appear in many types of networks Independently studied in many fields: Social science, Computer science, Physics, Biology, etc. Significance Biological systems are modular Biological systems are large and complex Metabolic pathways Protein complexes Transcriptional regulatory modules Communities provide a high-level overview of the networks Guilt-by-association Predict gene functions based on community memberships UTSA Community discovery problem Divide a network into relatively densely connected subnetworks Similar to clustering, but # of clusters is determined automatically Vertex reorder UTSA Modularity function (Q) Measure strength of community structures 5 Newman, Phy Rev E, 2003 e11 10 e22 15 Q (eii eii ) 20 e33 25 i 30 e44 35 40 e55 45 50 5 10 15 20 25 30 35 40 45 50 UTSA Q = 0.45 Q=0 Q = 0.40 Q = 0.56 Q = 0.54 Modularity automatically determines # of communities! Methods for community discovery Previous methods Fast but inaccurate (CNM, Phy Rev E, 2004) Accurate but slow (Guimera&Amaral, Nature 2005) Relatively accurate, relatively fast (Ruan&Zhang, AAAI 2006, ICDM2007) Relatively accurate, fast, memory intensive (Newman, PNAS 2006) Our new algorithm: Qcut/HQcut (Ruan&Zhang, Phys Rev E 2008) Accurate, fast and memory friendly HQcut solves the resolution limit of Q UTSA Algorithm Qcut 1. Recursive multi-way partitioning until Q is max 5 eig 5 kmeans 10 15 15 20 20 25 25 30 1 2 3 30 1 2 3 Improve Q by efficient heuristic search Accuracy 2. 10 Our method Newman’s Inter-community edge probability UTSA Resolution limit and HQcut Q is known to have a resolution limit problem Cannot detect small communities Q slight decreases if forced to split HQcut solves this problem Apply Qcut to get communities with largest Q Recursively search for sub-communities within each community without dramatic change to Q Statistical test for termination criteria Ruan & Zhang, Physical Review E 2008 UTSA Graphical user interface for Qcut/HQcut UTSA Application: protein complex prediction Input: a yeast PPI network Data from Krogan et.al., Nature. 2006;440:637-43 2708 vertices (proteins), 7123 interactions 289 communities Sizes range from 2 to 49 Evaluation: compare communities with known complexes manually curated in MIPS database UTSA Communities in a yeast PPI network Small ribosomal subunit (90%) RNA poly II mediator (83%) Proteasome core (90%) gamma-tubulin (77%) Exosome (94%) respiratory chain complex IV (82%) UTSA Communities vs. complexes Predicted complex Communities and complexes have good one to one correspondence Overall accuracy: 0.81 Newman: 0.58 Known complex Agreement between a predicted complex (P) and a known complex (K) = |P∩K| / sqrt(|P| x |K|). UTSA Work-in-progress: Random walk-based improvement Motivation: Three goals: PPI network often contain both false positive and false negative edges Hub genes prevent good partitioning Eliminate false positive edges Predict missing links Reduce the impact of hub genes Intuition: Two proteins with high topological similarity, regardless of connected or not, may belong to same complex Two proteins with direct link but very different topological properties may belong to different complexes UTSA Method overview Original network Initial prob vectors Equilibrium prob vectors 5 5 10 10 15 15 20 20 25 25 30 30 5 10 15 20 25 30 5 = Random walk with resistance New network Adjacency matrix 5 5 10 10 10 15 15 20 20 25 25 30 30 10 15 20 25 30 15 20 25 30 Distance calculation Similarity matrix 5 5 10 threshold (guided by topology) 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30 UTSA Preliminary results on yeast PPI Predicted PPIs have much higher functional relevance than removed PPIs, using several sources of evidence Gene Ontology Gene expression etc. New network significantly improved accuracy of protein complex predictions Using HQcut: 0.50 to 0.55 Using MCL: 0.48 to 0.59 UTSA Agenda Community structure in biological networks Prediction of protein complexes Network-based microarray data analysis Network-based biomarker discovery for metastatic breast cancer Conclusions UTSA Microarray data analysis Gene network structure is unknown Microarray measures gene expression (activity) level genes Conditions Clustering Clustering is the most common analysis tool Many clustering algorithms available K-means Hierarchical Self organizing maps Parameter (e.g., k) hard to guess Does not consider network structure Common functions? Common regulation? Predict functions for unknown genes? UTSA Network-based microarray data analysis Sample Construct Co-expression network Gene i j Genes i and j connected if their expression patterns are “sufficiently similar” = Pearson correlation coefficient > arbitrary threshold K nearest neighbors (KNN) Key: how to get the “best” network? UTSA Motivation Can we use the idea of community discovery for clustering microarray data? Advantages: Parameter free Network topology considered Constructed network may have other interesting applications beyond clustering UTSA Our idea Ruan, ICDM 2009 Microarray data Network series Similarity matrix Net_1, Qcut Most dense …… Intuition: the real network is naturally modular Can be measured by modularity (Q) If constructed right, should have the highest Q Net_m, Qcut Most sparse UTSA Our idea (cont’d) True network Modularity Random network Difference Network density • Therefore, use ∆Q to determine the best network parameter and obtain the best community structure UTSA Results: synthetic data set 1 High dimensional data generated by synDeca. 20 clusters of high dimensional points, plus some scatter points Clusters are of various shapes: eclipse, rectangle, random 1 Accuracy 0.9 100 0.8 200 0.7 300 0.6 ∆Q 400 0.5 500 0.4 600 0.3 QReal 0.2 QRandom 0.1 Qreal - Qrandom 700 800 Clustering Accuracy 900 0 1000 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 Number of neighbors 250 300 UTSA Comparison mKNN-HQcut with the optimum k 1 mKNN-HQcut with automatically determined k Clustering Accuracy 0.8 K-means 0.6 0.4 This work kmeans optimal knn HQcut 0.2 0 10 20 30 40 50 60 70 80 90 100 Dimension UTSA Results: synthetic data set 2 Gene expression data Thalamuthu et al, 2006 600 data sets ~600 genes, 50 conditions, 15 clusters 0 or 1x outliers Without outliers With outliers mKNN-HQcut With optimal k mKNN-HQcut With auto k UTSA Comparison with other methods Ruan et al., BioKDD 2010 UTSA Results on yeast stress response data 3000 genes, 173 samples Best k = 140. Resulting in 75 clusters UTSA Results on yeast stress response data Enrichment of common functions Accumulative hyper-geometric test (Fisher’s exact test) Protein biosynthesis (p < 10-96) Peroxisome (p < 10-13) Gene Nuclear transport (p < 10-50) mt ribosome (p < 10-63) DNA repair (p < 10-66) RNA splicing (p < 10-105) Nitrogen compound metabolism (p < 10-37) GO Function Terms UTSA Comparison with k-means Overall enrichment score mkNN-HQcut Using automatically determined k = 140 K-means UTSA An interesting community A 25-gene community missed by other methods 4 telomere maintenance genes (p < 10-7) 16 unknown genes, all located in chromosome telomeric regions 5 other genes at rim of the sub-network 4 transcription factors regulate many genes in the community UTSA Application to Arabidopsis data ~22000 genes, 1138 samples 1150 singletons 800 (300) modules of size >= 10 (20) > 80% (90%) of modules have enriched functions Much more significant than all five existing studies on the same data set Top 40 most significant modules UTSA UTSA Cis-regulatory network of Arabidopsis Motif Module Ruan et al., BMC Bioinfo, to appear UTSA Beyond gene clustering (1) Gene specific studies Collaborator is interested in Gibberellins A hormone important for the growth and development of plant Commercially important Biosynthesis and signaling well studied Transcriptional regulation of biosynthesis and signaling not yet clear 3 important gene families, GA20ox, GA3ox and GA2ox for biosynthesis Receptor gene family: GID1A,B,C Analyze the co-expression network around these genes UTSA 20ox GID1C GID1A 3ox 20ox5 GA3 GID1B 2ox 2ox6 2ox4 2ox8 2ox2 20ox1 3ox2 3ox4 3ox3 2ox3 20ox3 20ox4 20ox2 2ox1 3ox1 2ox7 UTSA UTSA Beyond gene clusters (2) Sample Cancer classification Gene Sample Sample: patient or cell lines Qcut Alizadeh et. al. Nature, 2000 UTSA Network of cell samples Shape: cell line / cancer type Color: clustering results Transformed cell lines Activated Follicular Blood B lymphoma (FL) DLBCL DLBCL Resting Blood B Blood T Chronic lymphocytic leukemia (CLL) Diffuse large B-cell Lymphoma (DLBCL) UTSA Survival rate after chemotherapy Survival rate: 73% Median survival time: 71.3 months Survival rate: 40% Median survival time: 22.3 months DLBCL-2 DLBCL-1 DLBCL-3 Survival rate: 20% Median survival time: 12.5 months UTSA Beyond gene clustering (3) Topology vs function % of essential proteins Jeong et. al. Nature 2001 Number of connections UTSA Community participation vs. essentiality (PPI) % Essential Hub Non-hub Community participation UTSA Hub Non-hub % Essential % Essential Community participation vs. essentiality (coexp) Participation < 0.2 Participation >= 0.2 Community participation Number of connections Key: how to systematically search for such relationships? Data mining – association rule? UTSA Agenda Community structure in biological networks Prediction of protein complexes Network-based microarray data analysis Network-based biomarker discovery for metastatic breast cancer Conclusions UTSA Background Metastasis is the spread of cancer from one organ to another non-adjacent organ or part. Challenge: Predict Metastasis If metastasis is likely => aggressive adjuvant therapy How to decide the likelihood? Traditional predictive factors are not good enough UTSA Microarray-based marker discovery Examine genome-wide expression profiles Idea: Limitations: Score individual genes for how well they discriminate between different classes of disease Establish gene expression signature # genes >> # patients Downstream effects Individual variations not attributed to cancer Consequences: Low reproducibility across data sets Missing biological insight M N UTSA Pathway vs. PPI Subnetwork as Marker Remedy to the problems: use pathway information Limitation: Majority of human genes not yet assigned to a definitive pathway Alternatively: protein-protein interaction (PPI) networks Less features => better stability Biological insight Genes in the same pathway may have short distance in PPI Subnetworks (potential pathways) as markers Chuang et al. 2007, Mol Syst Bio. Cannot differentiate causal vs. downstream genes UTSA Our approach Key observation: many known disease genes are not differentially expressed (DE) between metastasis and non-metastasis, but their neighbors are e.g P53 and BRAC2 Intuition: find a small number of intermediate genes that connect DE genes Known as the Steiner tree problem in computer science UTSA Steiner tree problem Given a connected graph and a list of input nodes, find the smallest number of additional nodes so that there is a tree connecting the input nodes and those additional nodes UTSA Method overview UTSA Experiment setup Two breast cancer microarray data sets van de Vijver et al, 2002, (Agilent, 78 +ve vs 217 -ve) Wang et al, 2005, (affy, 106 +ve vs 180 -ve) Two human PPI networks PINA (Wu et al 2009), 10,920 proteins and 61,746 interactions Chuang et al 2007, 11,203 proteins and 57,235 interactions UTSA Steiner tree example for Wang data set UTSA Cross-dataset stability of markers van de Vijver dataset Wang dataset Overlap Number of common genes(% of overlap) p-value Steiner tree-based markers Chuang-PPI 1123 1100 384(20.88%) 2E-127 PINA-PPI 981 1135 400(23.31%) 5E-158 Top scoring Steiner tree-based markers Chuang-PPI 203 194 63(18.86%) 7E-064 PINA-PPI 164 185 80(29.74%) 3E-103 Differentially expressed genes 324 319 34(5.58%) 1E-008 Approach in Chuang et al (2007) 618 906 175 (12.8%) 6E-054 Approach in original study 70 76 3(2.09%) .03 UTSA Enriched cancer-related pathways Enrichment score for Wang dataset with ChuangPPI network. UTSA Overlap with known breast cancer genes (A) (B) Overlap of 60 known breast cancer genes with STMs, t-STMs, DE genes, Chuang et al (2007) method and corresponding original studies. (A) Wang (B) van de Vijver dataset. UTSA Classification accuracy Classification accuracy based on AUC metric. The dataset label represents features taken from that dataset. For cross-data classification features taken from the labeled dataset and applied to the other dataset (A) Bayesian logistic regression (B) random tree classifier. UTSA Conclusions Methods for community discovery Fully automated, i.e. parameter-free Higher accuracy than competing methods that require extensive parameter tuning Improves protein complex prediction and microarray data clustering, including tumor subtype classification Many other applications Steiner-tree based biomarker discovery Improves stability of metastatic breast cancer markers, and cross-dataset classification accuracy A generic method for identifying the hidden causal genes given downstream targets UTSA Future work Network-based gene-specific studies Combine random walk with Steiner tree algorithms to improve biomarker discovery and cancer classification Other types of data Other diseases UTSA Questions? UTSA