CS 6293 Advanced Topics: Current Bioinformatics Biological networks: Theory and applications Lecture outline • Basic terminology and concepts in networks • Some interesting results between network properties and biological functions • Network clustering / community discovery • Applications of network clustering methods Network • A network refers to a graph • An useful concept in analyzing the interactions of different components in a system Biological networks • An abstract of the complex relationships among molecules in the cell • Many types. – – – – – – – Protein-protein interaction networks Protein-DNA(RNA) interaction networks Genetic interaction network Metabolic network Signal transduction networks (real) neural networks Many others • In some networks, edges have more precise meaning. In some others, meaning of edges is obscure Protein-protein interaction networks • Yeast PPI network • Nodes – proteins • Edges – interactions The color of a node indicates the phenotypic effect of removing the corresponding protein (red = lethal, green = non-lethal, orange = slow growth, yellow = unknown). Obtaining biological networks • Direct experimental methods – Protein-protein interaction networks • Yeast-2-hybrid • Tandem affinity purification • Co-immunoprecipitation – Protein-DNA interaction • Chromatin Immunoprecipitation (followed by microarray or sequencing, ChIP-chip, ChIP-seq) – High level of noises (false-positive and false-negative) • Computational prediction methods – Often cannot differentiate direct and indirect interactions Why networks? • Studying genes/proteins on the network level allows us to: – Assess the role of individual genes/proteins in the overall pathway – Evaluate redundancy of network components – Identify candidate genes involved in genetic diseases – Sets up the framework for mathematical models For complex systems, the actual output may not be predictable by looking at only individual components: The whole is greater than the sum of its parts Graphs • A graph G = (V, E) – V = set of vertices – E = set of edges = subset of V V – Thus |E| = O(|V|2) 1 Vertices: {1, 2, 3, 4} Edges: {(1, 2), (2, 3), (1, 3), (4, 3)} 2 4 3 Graph Variations (1) • Directed / undirected: – In an undirected graph: • Edge (u,v) E implies edge (v,u) E • Road networks between cities – In a directed graph: • Edge (u,v): uv does not imply vu • Street networks in downtown – Degree of vertex v: • The number of edges adjacency to v • For directed graph, there are in-degree and out-degree 1 1 2 4 3 In-degree = 3 Out-degree = 0 Directed 2 4 3 Degree = 3 Undirected Graph Variations (2) • Weighted / unweighted: – In a weighted graph, each edge or vertex has an associated weight (numerical value) • E.g., a road map: edges might be weighted w/ distance 1 1 0.3 2 4 2 0.4 3 Unweighted 4 1.2 1.9 3 Weighted Graph Variations (3) • Connected / disconnected: – A connected graph has a path from every vertex to every other – A directed graph is strongly connected if there is a directed path between any two vertices 1 2 4 3 Connected but not strongly connected Graph Variations (4) • Dense / sparse: – Graphs are sparse when the number of edges is linear to the number of vertices • |E| O(|V|) – Graphs are dense when the number of edges is quadratic to the number of vertices • |E| O(|V|2) – Most graphs of interest are sparse – If you know you are dealing with dense or sparse graphs, different data structures may make sense Representing Graphs • Assume V = {1, 2, …, n} • An adjacency matrix represents the graph as a n x n matrix A: – A[i, j] = 1 if edge (i, j) E = 0 if edge (i, j) E • For weighted graph – A[i, j] = wij if edge (i, j) E = 0 if edge (i, j) E • For undirected graph – Matrix is symmetric: A[i, j] = A[j, i] Graphs: Adjacency Matrix • Example: A 1 2 2 3 1 4 2 3 3 1 4 ?? 4 Graphs: Adjacency Matrix • Example: 1 2 4 3 A 1 2 3 4 1 0 1 1 0 2 0 0 1 0 3 0 0 0 0 4 0 0 1 0 How much storage does the adjacency matrix require? A: O(V2) Graphs: Adjacency Matrix • Example: 1 2 4 3 Undirected graph A 1 2 3 4 1 0 1 1 0 2 1 0 1 0 3 1 1 0 1 4 0 0 1 0 Graphs: Adjacency Matrix • Example: 1 5 6 2 9 4 4 3 Weighted graph A 1 2 3 4 1 0 5 6 0 2 5 0 9 0 3 6 9 0 4 4 0 0 4 0 Graphs: Adjacency Matrix • Time to answer if there is an edge between vertex u and v: Θ(1) • Memory required: Θ(n2) regardless of |E| – Usually too much storage for large graphs – But can be very efficient for small graphs • Most large interesting graphs are sparse – E.g., road networks (due to limit on junctions) – For this reason the adjacency list is often a more appropriate representation Graphs: Adjacency List • Adjacency list: for each vertex v V, store a list of vertices adjacent to v • Example: – – – – Adj[1] = {2,3} Adj[2] = {3} Adj[3] = {} Adj[4] = {3} 1 2 • Variation: can also keep a list of edges coming into vertex 4 3 Graph representations • Adjacency list 1 2 3 3 2 4 3 3 How much storage does the adjacency list require? A: O(V+E) Graph representations • Undirected graph 1 2 4 3 2 3 1 3 1 2 3 4 A 1 2 3 4 1 0 1 1 0 2 1 0 1 0 3 1 1 0 1 4 0 0 1 0 Graph representations • Weighted graph A 1 2 3 4 1 5 6 2 9 4 4 3 1 0 5 6 0 2,5 3,6 1,5 3,9 1,6 2,9 3,4 2 5 0 9 0 4,4 3 6 9 0 4 4 0 0 4 0 Graphs: Adjacency List • How much storage is required? • For directed graphs – |adj[v]| = out-degree(v) – Total # of items in adjacency lists is out-degree(v) = |E| • For undirected graphs – |adj[v]| = out-degree(v) – # items in adjacency lists is degree(v) = 2 |E| • So: Adjacency lists take (V+E) storage • Time needed to test if edge (u, v) E is O(n) Tradeoffs between the two representations |V| = n, |E| = m Adj Matrix test (u, v) E Θ(1) Degree(u) Θ(n) Memory Θ(n2) Edge insertion Θ(1) Edge deletion Θ(1) Graph traversal Θ(n2) Adj List O(n) O(n) Θ(n+m) Θ(1) O(n) Θ(n+m) Both representations are very useful and have different properties, although adjacency lists are probably better for most problems Structural properties of networks • • • • • • Degree distribution Average shortest path length Clustering coefficient Community structure Degree correlation Motivation to study structural properties: – Structure determines function – Functional structural properties may be shared by different types of real networks (bio or non-bio) Degree distribution P(k) • The probability that a selected node has exactly (or approximately) k links. – P(k) is obtained by counting the number of nodes N(k) with k = 1, 2… links divided by the total number of nodes N. Erdos-Renyi model • Each pair of nodes have a probability p to form an edge • Most nodes have about the same # of connections • Degree distribution is binomial or Poisson Real networks: scale-free • Heavy tail distribution – Power-law distribution • P(k) = k-r 100 Number of genes 80 60 40 20 0 0 10 20 30 40 Number of connections 50 60 Comparing Random and Scalefree distribution • In the random network, the five nodes with the most links (in red) are connected to only 27% of all nodes (green). In the scale-free network, the five most connected nodes (red) are connected to 60% of all nodes (green) (source: Nature) Robust yet fragile nature of networks Shortest and mean path length • Distance in networks is measured with the path length • As there are many alternative paths between two nodes, the shortest path between the selected nodes has a special role. • In directed networks, – AB is often different from the BA – Often there is no direct path between two nodes. • The average path length between all pairs of nodes offers a measure of a network’s overall navigability. • most pairs of vertices in a biological network seem to be connected by a short path – small-world property Clustering coefficient • Your clustering coefficient: the probability that two of your friends are also friends – You have m friends – Among your m friends, there are n pairs of friends • The maximum is m * (m-1) / 2 • C = 2 n / (m^2-m) • Clustering coefficient of a network: the average clustering coefficient of all individuals Clustering Coefficient ith node has ki neighbors linking with it Ci=2Ei/ki(ki-1)=2/9 Ei is the actual number of links between ki neighbors maximal number of links between ki neighbors is ki(ki-1)/2 The probability that two of your friends are also friends • Clustering coefficient of a network: average clustering coefficient of all nodes Degree correlation • Do rich people tend to hang together with rich people (rich-club)? • Or do they tend to interact with less wealthy people? • Do high degree nodes tend to connect to low degree nodes or high degree ones? Some interesting findings from biological networks • Jeong, Lethality and centrality in protein networks. Nature 411, 41-42 (3 May 2001) • Roger Guimerà and Luís A. Nunes Amaral, Functional cartography of complex metabolic networks. Nature 433, 895-900 (24 February 2005) • Han, et. al. Evidence for dynamically organized modularity in the yeast protein–protein interaction network. Nature 430, 88-93 (1 July 2004) % of essential proteins Connectivity vs essentiality Number of connections Jeong et. al. Nature 2001 Community role vs essentiality • Effect of a perturbation cannot depend on the node’s degree only! • Many hub genes are not essential • Some non-hub genes are essential • Maybe a gene’s role in her community is also important – Local leader? Global leader? Ambassador? – Guimerà and Amaral, Nature 433, 2005 Community structure • Role 1, 2, 3: non-hubs with increasing participation indices • Role 5, 6: hubs with increasing participation indices Dynamically organized modularity in the yeast PPI network • • • • Protein interaction networks are static Two proteins cannot interact if one is not expressed We should look at the gene expression level Han, et. al, Nature 430, 2004 Obtaining Data Distinguish party hubs from date hubs Red curve – hubs Cyan curve – nonhubs Black curve – randomized • Partners of date hubs are significantly more diverse in spatial distribution than partners of party hubs Effect of removal of nodes on average geodesic distance Original Network On removal of date hubs On removal of party hubs Green – nonhub nodes Brown – hubs Red – date hubs Blue – party hubs The ‘breakdown point’ is the threshold after which the main component of the network starts disintegrating. Dynamically organized modularity Red circles – Date hubs Blue squares - Modules Han-Yu Chuang, Eunjung Lee, Yu-Tseung Liu, Doheon Lee, Trey Ideker, Network-based classification of breast cancer metastasis, Mol Syst Biol. 2007; 3: 140. Challenge: Predict Metastasis • If metastasis is likely => aggressive adjuvant therapy – How to decide the likelihood? • Traditional predictive factors are not good Recently: Gene Marker Sets • Examine genome-wide expression profiles – Score individual genes for how well they discriminate between different classes of disease • Establish gene expression signature – Problem: # genes >> # patients Pathway Expression vs. PPI Subnetwork as Marker • Score known pathways for coherence of gene expression changes? – Majority of human genes not yet assigned to a definitive pathway • Large Protein-Protein Interaction networks recently became available – Extract subnetworks from PPI networks as markers Subnetwork Marker Identification: Data Used • 2 separate cohorts of breast cancer patients – van 't Veer et. al, and Wang et. al. – Roughly half had developed metastasis • Used Protein-Protein Interaction network obtained by assembling a pooled dataset – 57,235 interactions among 11,203 proteins Goal: Find Significantly Discriminative Subnetworks • Use a scoring system to search for subnetworks highly discriminative of metastasis Discriminative Score Function S Step 1: Assign activity scores to a subnetwork of genes Step 2: Assign discriminative score S to the subnetwork • Score(subnetwork) = Mutual Information between a subnetwork’s activity score vector and phenotype vector over all patients – S(k) = MI (a,c) Find Candidate Subnetworks using S and Greedy Search • Use a single PPI node as seed – At each iteration, add the neighbor resulting in highest score improvement – Stop when no addition increases score by rate r= .05, or distance from seed > 2 – Report candidate subnetwork and repeat with next node as seed Identify Significant Subnets from 3 Null Distributions • p1:100 expression perm. trials, p < 0.05 – Expression vectors of individual genes randomly permuted on the network • p2: 100 random subnetworks seeded at protein i, p < 0.05 • p3: 20,000 phenotype perm. trials, p < 0.00005 Results: Correspondence to hallmarks of cancer • For two datasets of 295 and 286 patients, 149 and 243 (resp.) discriminative subnets found • 47% and 65% of subnets enriched for common biological process • 66 and 153 subnets were enriched for processes involved in major events of cancer progression Results: Reproducibility • Subnetwork markers significantly more reproducible between datasets than individual gene markers Results: Reproducibility Dataset 1 Dataset 2 Results: Reproducibility Shared network motifs with differences in differential expression Left-hand side is from Dataset 1 and righthand side is from Dataset 2 Results: Subnetwork Markers as Classifiers Averaged expression values for each subnetwork were used as features for a classifier based on logistic regression For comparison, the top individual genemarkers were instead used as features Markers from one dataset were used as predictors of metastasis on the other dataset Results: Subnetwork Markers as Classifiers Dataset 1 markers tested on Dataset 2, and vice versa Results: Informative of Nondiscriminative Disease Genes Network analyses can identify proteins not differentially expressed, but required to connect higher scoring proteins in a significant subnetwork 85.9 and 96.7% of the significant subnetworks contained at least one protein that was not significantly differentially expressed in metastasis Results: Informative of Nondiscriminative Disease Genes Several established prognostic markers were not present in individual gene expression markers, but played a central, interconnecting role in discriminative subnetworks MYC, ERBB2 Community discovery: motivations • Biological networks are modular – Metabolic pathways – Protein complexes – Transcriptional regulatory modules • Provide a high-level overview of the networks • Predict gene functions based on communities Community discovery problem • Divide a network into relatively densely connected sub-networks Vertex reorder Challenges • How many communities? • Is there any community at all? Community structures • Also known as modules • Relatively densely connected sub-network • Quite common in real networks – Social networks – Internet – Biological networks – Transportation – Power grid Community discovery problem • Divide a network into relatively densely connected sub-networks Vertex reorder History • Social science: clustering – Based on affinities / similarities – Need to give # of clusters – Can always find clusters • Computer science: graph partitioning – Minimizing cut / cut ratio – Need to give # of partitions – Can always produce partitions • Preferred approach: natural division – Automatically determine # of communities – Do not partition if no community Modularity function (Q) • Measure strength of community structures – Newman, Phy Rev E, 2003 Number of communities 2 eii ai Q ( ) M i 1 M k Observed fraction of edges falling in community i -1 < Q < 1 Q = 0 if k = 1 Expected fraction of edges falling in community i e11 e12 a1 e11 e12 M a1 a2 a2 e21 e22 e e 21 22 Q = 0.45 Q=0 Goal: find the partition that has the highest Q value But: optimizing Q is NP-hard (Brandes et al., 2006) Q = 0.40 Q = 0.56 Q = 0.54 Heuristic algorithms • k-way spectral partitioning approximately optimizes Q if k is known – White & Smyth, SDM 2005 5 eig 5 kmeans 10 10 15 15 20 20 25 25 30 1 2 3 30 1 2 3 • k is unknown: test all possible k’s k-way spectral partitioning k=2 Q = 0.40 k=3 Q = 0.56 k=4 Q = 0.54 • Good accuracy • ~O(n3) time complexity; n: # of vertices Recursive bi-partitioning Q = 0.40 x Q = 0.54 Q = 0.56 • ~O(m logn) time complexity; m: # of edges • Accuracy worse than k-way partitioning Can we do better? • Objectives – Efficiency of the recursive algorithm – Accuracy of the k-way algorithm (or even better) • Ideas – Flexible l-way recursive partition (l = 2-5) • As efficient as recursive bi-partition • Accuracy similar to K-way algorithm • Ruan and Zhang, ICDM 2007 – Take the results of recursive algorithm as the starting point, do local improvement • Ruan and Zhang, Physical Review E 2008 Algorithm Qcut 1. Recursive partitioning until local maximum of Q 2. Refine solution by greedy search Consider two types of operations • Move a vertex to a different community • Merge two communities – Take the one with the largest improvement of Q – Repeat until no improvement of Q can be made – Go back to step 1 if necessary • Key: quickly find out the operation that can give the largest improvement of Q Identifying candidate moves • If vertex v moves from community i to j Q x j xi M (ai a j x) x M 2 xi – degree of v in community i x – degree of v ai – total degree for vertices in community i • Compute all potential Q from initial state • Update is almost constant for scale-free networks • Additional heuristics to improve efficiency Results on synthetic networks Accuracy Relative Q • State of the art: Newman, PNAS 2006 N_out • Relative Q = Qfound − Qtrue N_out An example Vertex reordered Real Structure Result of Qcut (Accuracy: 99%) Result of Newman (Accuracy: 77%) Results on real-world networks SA: Simulated annealing, Guimera & Amaral, Nature 2005 #Vertices Modularity #Edges Newman SA Qcut Social 67 142 0.573 0.608 0.587 Neuron 297 2359 0.396 0.408 0.398 Ecoli Reg 418 519 0.766 0.752 0.776 Circuit 512 819 0.804 0.670 0.815 Yeast Reg 688 1079 0.759 0.740 0.766 Ecoli PPI 1440 5871 0.367 0.387 0.387 Internet 3015 5156 0.611 0.624 0.632 Physicists 27519 116181 -- -- 0.744 Running time (seconds) #vertices #Edges Social 67 Neuron Running time Newman SA Qcut 142 0.0 5.4 2.0 297 2359 0.4 139 1.9 Ecoli Reg 418 519 0.7 147 12.7 Circuit 512 819 1.8 143 6.1 Yeast Reg 688 1079 3.0 1350 13.4 Ecoli PPI 1440 5871 33.2 5868 41.5 Internet 3015 5156 253.7 11040 43.0 Physicists 27519 116181 -- -- 2852 Graphical user interface for biologists A real-world example • • • • A classic social network: Karate club Node – club member; edge – friendship Club was split due to a dispute Can we predict the split given the network? Network of football teams • Vertices: football teams in NCAA Division I-A • Edges: games played in year 2000 • 110 teams • 11 conferences (excluding independents) • Most games are within conferences Big 12 Big East Conference vs. Community Conferences Communities discovered by Qcut / Newman Mountain West Pacific Ten Whose fault is it? Communities discovered by Qcut / Newman Q = 0.6239 Force the two conferences to be separated: Q = 0.6237 Resolution limit of the Q function c1 c2 c2 c1 Large network Large network Q1 Q2 • C1 and C2 separable only if Q2 – Q1 > 0 • Q2 – Q1 a1a2/2M – e12 – a1a2/2M: expected # of edges between C1 and C2 – e12: actual # of edges between c1 and c2 • If C1 and C2 are small relative to the network – Expected # edges < 1 – C1 and C2 non-separable even if connected by one edge – But the edge may be due to noise in data Resolution limit • Optimizing Q – may miss small communities – is sensitive to false-positive edges – cannot reveal hierarchical structures • A community containing some sub-communities • Real-world networks – contain both large and small communities – may have false positive edges • Biological data are extremely noisy – have hierarchies A solution: HQcut • Ruan & Zhang, Physical Review E 2008 • Apply Qcut to get communities with largest Q • Recursively search for sub-communities within each community • When to stop? – Q value of sub-network is small, or – Q is not statistically significant • Estimated by Monte-Carlo method Randomize randQ = 0.15 0.016 Q = 0.49 Z-score = (0.49 - 0.15) / 0.016 = 21 Randomize randQ = 0.15 0.016 Q = 0.18 Z-score = (0.18 - 0.15) / 0.016 = 1.9 Randomize Q = 0.49 randQ = 0.52 0.031 Z-score = (0.49 - 0. 52) / 0.031 = -1.3 Q = 0.49 Z-score = -1.3 Q = 0.18 Z-score = 1.9 Q = 0.49 Z-score = 21 Large network Test on synthetic networks • Network: 1000 vertices • Community sizes vary from 15 to 100 Accuracy Example communities Discovered by Qcut Discovered by HQcut Results for the NCAA teams Communities by Qcut/Newman Communities by HQcut Mountain West Pacific Ten Applications to a PPI network • Protein-protein interaction (PPI) network – Vertices: proteins – Edges: interactions detected by experiments • Motivation: – Community = protein complex? • Protein complex – Group of proteins associated via interactions – Elementary functional unit in the cell – Prediction from PPI network is important Experiments • Data set – A yeast protein-protein interaction network • Krogan et.al., Nature. 2006 – 2708 proteins, 7123 interactions • Algorithms: – Qcut, HQcut, Newman • Evaluation – ~300 Known protein complexes in MIPS – How well does a community match to a known protein complex? Results Newman Qcut HQcut # of communities 56 93 316 Max community size 312 264 60 # of matched communities 53 52 216 Communities with matching score = 1 5 (9%) 7 (13%) 43 (20%) 0.56 0.55 0.70 3 41 100 Average matching score # of novel predictions Communities found by HQcut Small ribosomal subunit (90%) RNA poly II mediator (83%) Proteasome core (90%) gamma-tubulin (77%) Exosome (94%) respiratory chain complex IV (82%) Example hierarchical community Microarray data Sample • Data organized into a matrix Gene – Rows are genes – Columns are samples representing different time points, conditions, tissues, etc. • Analysis techniques Red: high activity Green: low activity • Characteristics of microarray data – – – – Differential expression analysis Classification and clustering Regulatory network construction Enrichment analysis – High dimensionality and noise – Underlying topology unknown, often irregular shape Microarray data clustering Sample Gene Analyze genes in each cluster • Common functions? • Common regulation? • Predict functions for unknown genes? Red: high activity Green: low activity • Many clustering algorithms available – – – – – K-means Hierarchical Self organizing maps Parameter hard to tune Does not consider network topology Network-based data analysis Sample Construct Co-expression network Gene i j = • Genes i and j connected if their expression patterns are “sufficiently similar” – Similarity > threshold • Long list of references – K nearest neighbors • Recently became popular • Many interesting applications beyond clustering • Focus here is clustering Motivation • Can we use the idea of community finding for clustering microarray data? • Advantages: – Parameter free – Network topology considered – Constructed network may have other uses Network-based microarray data analysis Sample Construct Co-expression network = Gene i j • How to get the networks? – Threshold-based – Nearest neighbors How to determine the right cutoff? • Can we use a complete weight matrix? – Complete graph, with weighted edges – In general, no, since Q is ill-defined on weighted networks Network-based microarray data analysis • There is an implicit network structure gene Condition Clustering • Motivation: true network should be naturally modular – Can be measured by modularity (Q) – If constructed right, should have the highest Q Method overview Network series Net_1, Most dense Qcut …… Microarray data Similarity matrix Net_m, Qcut Most sparse Method overview (cont’d) Modularity True network Random network Difference Network density • Therefore, use ∆Q to determine the best network parameter and obtain the best community structure • We actually run HQcut, a variant of Qcut, in order to avoid resolution limit (Ruan & Zhang, Phys Rev E 2008) Network construction methods • Value-based method – Remove edges with similarities < ε. – Fixed ε for all vertices – May have problem detecting weakly correlated modules • Asymmetric k-nearest neighbors (aKNN) – – – – Connect each vertex to k other vertices Fixed k for all vertices (k < 10 good enough) Minimum degree = k. max = ? Sensitive to outliers • Mutual k-nearest-neighbors (mKNN) – – – – Association confirmed by both ends Maximum degree = k, min = 0. (k larger than in aKNN.) Outlier can be detected. Ruan, ICDM 2009 Results: synthetic data set 1 • High dimensional data generated by synDeca. – 20 clusters of high dimensional points, plus some scatter points – Clusters are of various shapes: eclipse, rectangle, random 1 Accuracy 0.9 100 0.8 200 0.7 300 0.6 ∆Q 400 0.5 500 0.4 600 0.3 QReal 0.2 QRandom 0.1 Qreal - Qrandom 700 800 Clustering Accuracy 900 0 1000 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 Number of neighbors 250 300 Comparison mKNN-HQcut with the optimum k 1 mKNN-HQcut with automatically determined k Clustering Accuracy 0.8 0.6 0.4 This work kmeans optimal knn HQcut 0.2 0 10 20 30 40 50 60 Dimension 70 80 90 100 Results: synthetic data set 2 • Gene expression data – Thalamuthu et al, 2006 – 600 data sets – ~600 genes, 50 conditions, 15 clusters Without outliers – 0 or 1x outliers mKNN-HQcut With optimal k mKNN-HQcut With auto k With outliers Comparison with other methods Results on yeast stress response data • 3000 genes, 173 samples Best k = 140. Resulting in 75 clusters Results on yeast stress response data • Enrichment of common functions – Accumulative hyper-geometric test Protein biosynthesis (p < 10-96) Peroxisome (p < 10-13) Gene Nuclear transport (p < 10-50) mt ribosome (p < 10-63) DNA repair (p < 10-66) RNA splicing (p < 10-105) Nitrogen compound metabolism (p < 10-37) GO Function Terms ComparisonUsing with k-means automatically determined k = 140 Overall function coherence mkNN-HQcut K-means Application to Arabidopsis data • ~22000 genes, 1138 samples • 1150 singletons • 800 (300) modules of size >= 10 (20) • > 80% (90%) of modules have enriched functions • Much more significant than all five existing studies on the same data set Top 40 most significant modules Cis-regulatory network of Arabidopsis Motif Module Beyond gene clusters (1) • Gene specific studies – Collaborator is interested in Gibberellins – A hormone important for the growth and development of plant – Commercially important – Biosynthesis and signaling well studied – Transcriptional regulation of biosynthesis and signaling not yet clear – 3 important gene families, GA20ox, GA3ox and GA2ox for biosynthesis – Receptor gene family: GID1A,B,C – Analyze the co-expression network around these genes 20ox GID1C GID1A 3ox 20ox5 GA3 GID1B 2ox 2ox6 2ox4 2ox8 2ox2 20ox1 3ox2 3ox4 3ox3 2ox3 20ox3 20ox4 20ox2 2ox1 3ox1 2ox7 Beyond gene clusters (2) Sample • Cancer classification Gene Sample Sample: tumor/normal cells Alizadeh et. al. Nature, 2000 Qcut Network of cell samples Black: normal cells Blue: tumor cells Follicular Transformed cell lines lymphoma (FL) Activated Blood B DLBCL DLBCL Resting Blood B Blood T Chronic lymphocytic leukemia (CLL) Diffuse large B-cell Lymphoma (DLBCL) Survival rate after chemotherapy Survival rate: 73% Median survival time: 71.3 months Survival rate: 40% Median survival time: 22.3 months DLBCL-2 DLBCL-1 DLBCL-3 Survival rate: 20% Median survival time: 12.5 months Beyond gene clustering (3) % of essential proteins • Topology vs function Jeong et. al. Nature 2001 Number of connections Hub Non-hub % Essential % Essential Community participation vs. essentiality Participation < 0.2 Participation >= 0.2 Community participation Number of connections • Key: how to systematically search for such relationships?