CS 6293 Advanced Topics: Current Bioinformatics Biological networks: Theory and applications

advertisement
CS 6293 Advanced Topics:
Current Bioinformatics
Biological networks:
Theory and applications
Lecture outline
• Basic terminology and concepts in
networks
• Some interesting results between network
properties and biological functions
• Network clustering / community discovery
• Applications of network clustering methods
Network
• A network refers to a graph
• An useful concept in analyzing the
interactions of different components in a
system
Biological networks
• An abstract of the complex relationships among
molecules in the cell
• Many types.
–
–
–
–
–
–
–
Protein-protein interaction networks
Protein-DNA(RNA) interaction networks
Genetic interaction network
Metabolic network
Signal transduction networks
(real) neural networks
Many others
• In some networks, edges have more precise meaning. In
some others, meaning of edges is obscure
Protein-protein interaction networks
• Yeast PPI network
• Nodes – proteins
• Edges – interactions
The color of a node
indicates the phenotypic
effect of removing the
corresponding protein
(red = lethal, green =
non-lethal, orange = slow
growth, yellow =
unknown).
Obtaining biological networks
• Direct experimental methods
– Protein-protein interaction networks
• Yeast-2-hybrid
• Tandem affinity purification
• Co-immunoprecipitation
– Protein-DNA interaction
• Chromatin Immunoprecipitation (followed by microarray or
sequencing, ChIP-chip, ChIP-seq)
– High level of noises (false-positive and false-negative)
• Computational prediction methods
– Often cannot differentiate direct and indirect
interactions
Why networks?
• Studying genes/proteins on the network level
allows us to:
– Assess the role of individual genes/proteins in the
overall pathway
– Evaluate redundancy of network components
– Identify candidate genes involved in genetic diseases
– Sets up the framework for mathematical models
For complex systems, the actual output may not be
predictable by looking at only individual components:
The whole is greater than the sum of its parts
Graphs
• A graph G = (V, E)
– V = set of vertices
– E = set of edges = subset of V  V
– Thus |E| = O(|V|2)
1
Vertices: {1, 2, 3, 4}
Edges: {(1, 2), (2, 3), (1, 3), (4, 3)}
2
4
3
Graph Variations (1)
• Directed / undirected:
– In an undirected graph:
• Edge (u,v)  E implies edge (v,u)  E
• Road networks between cities
– In a directed graph:
• Edge (u,v): uv does not imply vu
• Street networks in downtown
– Degree of vertex v:
• The number of edges adjacency to v
• For directed graph, there are in-degree and out-degree
1
1
2
4
3
In-degree = 3
Out-degree = 0
Directed
2
4
3
Degree = 3
Undirected
Graph Variations (2)
• Weighted / unweighted:
– In a weighted graph, each edge or vertex has an
associated weight (numerical value)
• E.g., a road map: edges might be weighted w/ distance
1
1
0.3
2
4
2
0.4
3
Unweighted
4
1.2
1.9
3
Weighted
Graph Variations (3)
• Connected / disconnected:
– A connected graph has a path from every
vertex to every other
– A directed graph is strongly connected if there
is a directed path between any two vertices
1
2
4
3
Connected but not
strongly connected
Graph Variations (4)
• Dense / sparse:
– Graphs are sparse when the number of edges is
linear to the number of vertices
• |E|  O(|V|)
– Graphs are dense when the number of edges is
quadratic to the number of vertices
• |E|  O(|V|2)
– Most graphs of interest are sparse
– If you know you are dealing with dense or sparse
graphs, different data structures may make sense
Representing Graphs
• Assume V = {1, 2, …, n}
• An adjacency matrix represents the graph as a n
x n matrix A:
– A[i, j]
= 1 if edge (i, j)  E
= 0 if edge (i, j)  E
• For weighted graph
– A[i, j]
= wij if edge (i, j)  E
= 0 if edge (i, j)  E
• For undirected graph
– Matrix is symmetric: A[i, j] = A[j, i]
Graphs: Adjacency Matrix
• Example:
A
1
2
2
3
1
4
2
3
3
1
4
??
4
Graphs: Adjacency Matrix
• Example:
1
2
4
3
A
1
2
3
4
1
0
1
1
0
2
0
0
1
0
3
0
0
0
0
4
0
0
1
0
How much storage does the adjacency matrix require?
A: O(V2)
Graphs: Adjacency Matrix
• Example:
1
2
4
3
Undirected graph
A
1
2
3
4
1
0
1
1
0
2
1
0
1
0
3
1
1
0
1
4
0
0
1
0
Graphs: Adjacency Matrix
• Example:
1
5
6
2
9
4
4
3
Weighted graph
A
1
2
3
4
1
0
5
6
0
2
5
0
9
0
3
6
9
0
4
4
0
0
4
0
Graphs: Adjacency Matrix
• Time to answer if there is an edge
between vertex u and v: Θ(1)
• Memory required: Θ(n2) regardless of |E|
– Usually too much storage for large graphs
– But can be very efficient for small graphs
• Most large interesting graphs are sparse
– E.g., road networks (due to limit on junctions)
– For this reason the adjacency list is often a
more appropriate representation
Graphs: Adjacency List
• Adjacency list: for each vertex v  V, store a list
of vertices adjacent to v
• Example:
–
–
–
–
Adj[1] = {2,3}
Adj[2] = {3}
Adj[3] = {}
Adj[4] = {3}
1
2
• Variation: can also keep
a list of edges coming into vertex
4
3
Graph representations
• Adjacency list
1
2
3
3
2
4
3
3
How much storage does the adjacency list require?
A: O(V+E)
Graph representations
• Undirected graph
1
2
4
3
2
3
1
3
1
2
3
4
A
1 2
3
4
1
0 1
1
0
2
1 0
1
0
3
1 1
0
1
4
0 0
1
0
Graph representations
• Weighted graph
A
1
2
3
4
1
5
6
2
9
4
4
3
1
0
5
6
0
2,5
3,6
1,5
3,9
1,6
2,9
3,4
2
5
0
9
0
4,4
3
6
9
0
4
4
0
0
4
0
Graphs: Adjacency List
• How much storage is required?
• For directed graphs
– |adj[v]| = out-degree(v)
– Total # of items in adjacency lists is
 out-degree(v) = |E|
• For undirected graphs
– |adj[v]| = out-degree(v)
– # items in adjacency lists is
 degree(v) = 2 |E|
• So: Adjacency lists take (V+E) storage
• Time needed to test if edge (u, v)  E is O(n)
Tradeoffs between the two representations
|V| = n, |E| = m
Adj Matrix
test (u, v)  E Θ(1)
Degree(u)
Θ(n)
Memory
Θ(n2)
Edge insertion Θ(1)
Edge deletion Θ(1)
Graph traversal Θ(n2)
Adj List
O(n)
O(n)
Θ(n+m)
Θ(1)
O(n)
Θ(n+m)
Both representations are very useful and have different properties,
although adjacency lists are probably better for most problems
Structural properties of networks
•
•
•
•
•
•
Degree distribution
Average shortest path length
Clustering coefficient
Community structure
Degree correlation
Motivation to study structural properties:
– Structure determines function
– Functional structural properties may be shared by
different types of real networks (bio or non-bio)
Degree distribution P(k)
• The probability that a selected node has
exactly (or approximately) k links.
– P(k) is obtained by counting the number of nodes
N(k) with k = 1, 2… links divided by the total
number of nodes N.
Erdos-Renyi model
• Each pair of nodes have
a probability p to form
an edge
• Most nodes have about
the same # of
connections
• Degree distribution is
binomial or Poisson
Real networks: scale-free
• Heavy tail distribution
– Power-law distribution
• P(k) = k-r
100
Number of genes
80
60
40
20
0
0
10
20
30
40
Number of connections
50
60
Comparing Random and Scalefree distribution
• In the random network, the five nodes with the most links
(in red) are connected to only 27% of all nodes (green). In
the scale-free network, the five most connected nodes (red)
are connected to 60% of all nodes (green) (source: Nature)
Robust yet fragile nature of networks
Shortest and mean path length
• Distance in networks is measured
with the path length
• As there are many alternative paths
between two nodes, the shortest
path between the selected nodes
has a special role.
• In directed networks,
– AB is often different from the BA
– Often there is no direct path between
two nodes.
• The average path length between all
pairs of nodes offers a measure of a
network’s overall navigability.
• most pairs of vertices in a biological
network seem to be connected by a
short path – small-world property
Clustering coefficient
• Your clustering coefficient: the probability
that two of your friends are also friends
– You have m friends
– Among your m friends, there are n pairs of
friends
• The maximum is m * (m-1) / 2
• C = 2 n / (m^2-m)
• Clustering coefficient of a network: the
average clustering coefficient of all
individuals
Clustering Coefficient
ith node has ki neighbors linking with it
Ci=2Ei/ki(ki-1)=2/9
Ei is the actual number of links
between ki neighbors
maximal number of links between ki
neighbors is ki(ki-1)/2
The probability that two of your
friends are also friends
• Clustering coefficient of a network: average clustering coefficient of all nodes
Degree correlation
• Do rich people tend to hang together with
rich people (rich-club)?
• Or do they tend to interact with less
wealthy people?
• Do high degree nodes tend to connect to
low degree nodes or high degree ones?
Some interesting findings from
biological networks
• Jeong, Lethality and centrality in protein
networks. Nature 411, 41-42 (3 May 2001)
• Roger Guimerà and Luís A. Nunes Amaral,
Functional cartography of complex metabolic
networks. Nature 433, 895-900 (24 February
2005)
• Han, et. al. Evidence for dynamically organized
modularity in the yeast protein–protein
interaction network. Nature 430, 88-93 (1 July
2004)
% of essential proteins
Connectivity vs essentiality
Number of connections
Jeong et. al. Nature 2001
Community role vs essentiality
• Effect of a perturbation cannot depend on
the node’s degree only!
• Many hub genes are not essential
• Some non-hub genes are essential
• Maybe a gene’s role in her community is
also important
– Local leader? Global leader? Ambassador?
– Guimerà and Amaral, Nature 433, 2005
Community structure
• Role 1, 2, 3: non-hubs with increasing
participation indices
• Role 5, 6: hubs with increasing participation
indices
Dynamically organized modularity in the
yeast PPI network
•
•
•
•
Protein interaction networks are static
Two proteins cannot interact if one is not expressed
We should look at the gene expression level
Han, et. al, Nature 430, 2004
Obtaining Data
Distinguish party hubs from date hubs
Red curve – hubs
Cyan curve – nonhubs
Black curve – randomized
• Partners of date hubs are significantly more diverse in spatial distribution
than partners of party hubs
Effect of removal of nodes on average
geodesic distance
Original Network
On removal of date hubs
On removal of party hubs
Green – nonhub nodes
Brown – hubs
Red – date hubs
Blue – party hubs
The ‘breakdown point’ is the threshold after which the main
component of the network starts disintegrating.
Dynamically organized modularity
Red circles – Date hubs
Blue squares - Modules
Han-Yu Chuang, Eunjung Lee, Yu-Tseung Liu, Doheon
Lee, Trey Ideker, Network-based classification of breast
cancer metastasis, Mol Syst Biol. 2007; 3: 140.
Challenge: Predict Metastasis
• If metastasis is likely => aggressive
adjuvant therapy
– How to decide the likelihood?
• Traditional predictive factors are not good
Recently: Gene Marker Sets
• Examine genome-wide expression profiles
– Score individual genes for how well they
discriminate between different classes of disease
• Establish gene expression signature
– Problem: # genes >> # patients
Pathway Expression vs. PPI
Subnetwork as Marker
• Score known pathways for
coherence of gene expression
changes?
– Majority of human genes not yet
assigned to a definitive pathway
• Large Protein-Protein Interaction
networks recently became
available
– Extract subnetworks from PPI
networks as markers
Subnetwork Marker Identification:
Data Used
• 2 separate cohorts of breast cancer
patients
– van 't Veer et. al, and Wang et. al.
– Roughly half had developed metastasis
• Used Protein-Protein Interaction network
obtained by assembling a pooled dataset
– 57,235 interactions among 11,203 proteins
Goal: Find Significantly
Discriminative Subnetworks
• Use a scoring system to search for
subnetworks highly discriminative of
metastasis
Discriminative Score Function
S
Step 1: Assign activity scores
to a subnetwork of genes
Step 2: Assign discriminative
score S to the subnetwork
• Score(subnetwork) = Mutual Information
between a subnetwork’s activity score
vector and phenotype vector over all
patients
– S(k) = MI (a,c)
Find Candidate Subnetworks
using S and Greedy Search
• Use a single PPI node as seed
– At each iteration, add the neighbor
resulting in highest score improvement
– Stop when no addition increases score
by rate r= .05, or distance from seed > 2
– Report candidate subnetwork and
repeat with next node as seed
Identify Significant Subnets
from 3 Null Distributions
• p1:100 expression perm. trials, p < 0.05
– Expression vectors of individual genes
randomly permuted on the network
• p2: 100 random subnetworks seeded at
protein i, p < 0.05
• p3: 20,000 phenotype perm. trials, p <
0.00005
Results: Correspondence
to hallmarks of cancer
• For two datasets of 295 and
286 patients, 149 and 243
(resp.) discriminative
subnets found
• 47% and 65% of subnets
enriched for common
biological process
• 66 and 153 subnets were
enriched for processes
involved in major events of
cancer progression
Results: Reproducibility
• Subnetwork markers significantly more
reproducible between datasets than individual
gene markers
Results: Reproducibility
Dataset 1
Dataset 2
Results: Reproducibility

Shared network motifs with differences in
differential expression
 Left-hand side is from Dataset 1 and righthand side is from Dataset 2
Results: Subnetwork Markers as
Classifiers

Averaged expression values for each
subnetwork were used as features for a
classifier based on logistic regression

For comparison, the top individual genemarkers were instead used as features
Markers from one dataset were used as
predictors of metastasis on the other dataset

Results: Subnetwork Markers as
Classifiers

Dataset 1 markers tested on Dataset
2, and vice versa
Results: Informative of Nondiscriminative Disease Genes

Network analyses can identify proteins not
differentially expressed, but required to
connect higher scoring proteins in a
significant subnetwork

85.9 and 96.7% of the significant
subnetworks contained at least one protein
that was not significantly differentially
expressed in metastasis
Results: Informative of Nondiscriminative Disease Genes

Several established prognostic markers were
not present in individual gene expression
markers, but played a central, interconnecting
role in discriminative subnetworks
 MYC, ERBB2
Community discovery: motivations
• Biological networks are modular
– Metabolic pathways
– Protein complexes
– Transcriptional regulatory modules
• Provide a high-level overview of the
networks
• Predict gene functions based on
communities
Community discovery problem
• Divide a network into relatively densely
connected sub-networks
Vertex
reorder
Challenges
• How many communities?
• Is there any community at all?
Community structures
• Also known as modules
• Relatively densely connected sub-network
• Quite common in real networks
– Social networks
– Internet
– Biological networks
– Transportation
– Power grid
Community discovery problem
• Divide a network into relatively densely
connected sub-networks
Vertex
reorder
History
• Social science: clustering
– Based on affinities / similarities
– Need to give # of clusters
– Can always find clusters
• Computer science: graph partitioning
– Minimizing cut / cut ratio
– Need to give # of partitions
– Can always produce partitions
• Preferred approach: natural division
– Automatically determine # of communities
– Do not partition if no community
Modularity function (Q)
• Measure strength of community structures
– Newman, Phy Rev E, 2003
Number of
communities
2
eii  ai 
Q  (    )
M 
i 1 M
k
Observed fraction of edges
falling in community i
-1 < Q < 1
Q = 0 if k = 1
Expected fraction of edges
falling in community i
e11 e12 a1  e11  e12
M  a1  a2
a2  e21  e22
e
e
21
22
Q = 0.45
Q=0
Goal: find the partition that has the highest Q value
But: optimizing Q is NP-hard (Brandes et al., 2006)
Q = 0.40
Q = 0.56
Q = 0.54
Heuristic algorithms
• k-way spectral partitioning approximately
optimizes Q if k is known
– White & Smyth, SDM 2005
5
eig
5
kmeans
10
10
15
15
20
20
25
25
30
1
2
3
30
1
2
3
• k is unknown: test all possible k’s
k-way spectral partitioning
k=2
Q = 0.40
k=3
Q = 0.56
k=4
Q = 0.54
• Good accuracy
• ~O(n3) time complexity; n: # of vertices
Recursive bi-partitioning
Q = 0.40
x
Q = 0.54
Q = 0.56
• ~O(m logn) time complexity; m: # of edges
• Accuracy worse than k-way partitioning
Can we do better?
• Objectives
– Efficiency of the recursive algorithm
– Accuracy of the k-way algorithm (or even better)
• Ideas
– Flexible l-way recursive partition (l = 2-5)
• As efficient as recursive bi-partition
• Accuracy similar to K-way algorithm
• Ruan and Zhang, ICDM 2007
– Take the results of recursive algorithm as the starting
point, do local improvement
• Ruan and Zhang, Physical Review E 2008
Algorithm Qcut
1. Recursive partitioning until local maximum of Q
2. Refine solution by greedy search
 Consider two types of operations
•
Move a vertex to a different community
• Merge two communities
– Take the one with the largest improvement of Q
– Repeat until no improvement of Q can be made
– Go back to step 1 if necessary
• Key: quickly find out the operation that can give the largest
improvement of Q
Identifying candidate moves
• If vertex v moves from community i to j
Q 
x j  xi
M

(ai  a j  x) x
M
2
xi – degree of v in community i
x – degree of v
ai – total degree for vertices in community i
• Compute all potential Q from initial state
• Update is almost constant for scale-free
networks
• Additional heuristics to improve efficiency
Results on synthetic networks
Accuracy
Relative Q
• State of the art: Newman, PNAS 2006
N_out
• Relative Q = Qfound − Qtrue
N_out
An
example
Vertex reordered
Real Structure
Result of Qcut (Accuracy: 99%)
Result of Newman (Accuracy: 77%)
Results on real-world networks
SA: Simulated annealing, Guimera & Amaral, Nature 2005
#Vertices
Modularity
#Edges
Newman
SA
Qcut
Social
67
142
0.573
0.608
0.587
Neuron
297
2359
0.396
0.408
0.398
Ecoli Reg
418
519
0.766
0.752
0.776
Circuit
512
819
0.804
0.670
0.815
Yeast Reg
688
1079
0.759
0.740
0.766
Ecoli PPI
1440
5871
0.367
0.387
0.387
Internet
3015
5156
0.611
0.624
0.632
Physicists
27519
116181
--
--
0.744
Running time (seconds)
#vertices
#Edges
Social
67
Neuron
Running time
Newman
SA
Qcut
142
0.0
5.4
2.0
297
2359
0.4
139
1.9
Ecoli Reg
418
519
0.7
147
12.7
Circuit
512
819
1.8
143
6.1
Yeast Reg
688
1079
3.0
1350
13.4
Ecoli PPI
1440
5871
33.2
5868
41.5
Internet
3015
5156
253.7
11040
43.0
Physicists
27519
116181
--
--
2852
Graphical user interface for biologists
A real-world example
•
•
•
•
A classic social network: Karate club
Node – club member; edge – friendship
Club was split due to a dispute
Can we predict the split given the network?
Network of football teams
• Vertices: football
teams in NCAA
Division I-A
• Edges: games
played in year 2000
• 110 teams
• 11 conferences
(excluding
independents)
• Most games are
within conferences
Big 12
Big East
Conference vs. Community
Conferences
Communities discovered
by Qcut / Newman
Mountain West
Pacific Ten
Whose fault is it?
Communities discovered
by Qcut / Newman
Q = 0.6239
Force the two
conferences to be
separated:
Q = 0.6237
Resolution limit of the Q
function
c1
c2
c2
c1
Large
network
Large
network
Q1
Q2
• C1 and C2 separable only if Q2 – Q1 > 0
• Q2 – Q1  a1a2/2M – e12
– a1a2/2M: expected # of edges between C1 and C2
– e12: actual # of edges between c1 and c2
• If C1 and C2 are small relative to the network
– Expected # edges < 1
– C1 and C2 non-separable even if connected by one edge
– But the edge may be due to noise in data
Resolution limit
• Optimizing Q
– may miss small communities
– is sensitive to false-positive edges
– cannot reveal hierarchical structures
• A community containing some sub-communities
• Real-world networks
– contain both large and small communities
– may have false positive edges
• Biological data are extremely noisy
– have hierarchies
A solution: HQcut
• Ruan & Zhang, Physical Review E 2008
• Apply Qcut to get communities with largest
Q
• Recursively search for sub-communities
within each community
• When to stop?
– Q value of sub-network is small, or
– Q is not statistically significant
• Estimated by Monte-Carlo method
Randomize
randQ = 0.15  0.016
Q = 0.49
Z-score = (0.49 - 0.15) / 0.016 = 21
Randomize
randQ = 0.15  0.016
Q = 0.18
Z-score = (0.18 - 0.15) / 0.016 = 1.9
Randomize
Q = 0.49
randQ = 0.52  0.031
Z-score = (0.49 - 0. 52) / 0.031 = -1.3
Q = 0.49
Z-score = -1.3
Q = 0.18
Z-score = 1.9
Q = 0.49
Z-score = 21
Large network
Test on synthetic networks
• Network: 1000 vertices
• Community sizes vary from 15 to 100
Accuracy
Example communities
Discovered by Qcut
Discovered by HQcut
Results for the NCAA teams
Communities by Qcut/Newman
Communities by HQcut
Mountain West
Pacific Ten
Applications to a PPI network
• Protein-protein interaction (PPI) network
– Vertices: proteins
– Edges: interactions detected by experiments
• Motivation:
– Community = protein complex?
• Protein complex
– Group of proteins associated via interactions
– Elementary functional unit in the cell
– Prediction from PPI network is important
Experiments
• Data set
– A yeast protein-protein interaction network
• Krogan et.al., Nature. 2006
– 2708 proteins, 7123 interactions
• Algorithms:
– Qcut, HQcut, Newman
• Evaluation
– ~300 Known protein complexes in MIPS
– How well does a community match to a
known protein complex?
Results
Newman
Qcut
HQcut
# of communities
56
93
316
Max community size
312
264
60
# of matched communities
53
52
216
Communities with matching
score = 1
5 (9%)
7 (13%)
43 (20%)
0.56
0.55
0.70
3
41
100
Average matching score
# of novel predictions
Communities found by HQcut
Small ribosomal
subunit (90%)
RNA poly II
mediator (83%)
Proteasome
core (90%)
gamma-tubulin
(77%)
Exosome (94%)
respiratory chain
complex IV (82%)
Example hierarchical community
Microarray data
Sample
• Data organized into a matrix
Gene
– Rows are genes
– Columns are samples representing
different time points, conditions,
tissues, etc.
• Analysis techniques
Red: high activity
Green: low activity
• Characteristics of microarray data
–
–
–
–
Differential expression analysis
Classification and clustering
Regulatory network construction
Enrichment analysis
– High dimensionality and noise
– Underlying topology unknown, often
irregular shape
Microarray data clustering
Sample
Gene
Analyze genes in each
cluster
• Common functions?
• Common regulation?
• Predict functions for
unknown genes?
Red: high activity
Green: low activity
• Many clustering algorithms
available
–
–
–
–
–
K-means
Hierarchical
Self organizing maps
Parameter hard to tune
Does not consider network
topology
Network-based data analysis
Sample
Construct
Co-expression
network
Gene
i
j
=
• Genes i and j connected if their expression
patterns are “sufficiently similar”
– Similarity > threshold
• Long list of references
– K nearest neighbors
• Recently became popular
• Many interesting applications beyond clustering
• Focus here is clustering
Motivation
• Can we use the idea of community
finding for clustering microarray data?
• Advantages:
– Parameter free
– Network topology considered
– Constructed network may have other uses
Network-based microarray data
analysis
Sample
Construct
Co-expression
network
=
Gene
i
j
• How to get the networks?
– Threshold-based
– Nearest neighbors
How to determine the right cutoff?
• Can we use a complete weight matrix?
– Complete graph, with weighted edges
– In general, no, since Q is ill-defined on weighted networks
Network-based microarray data
analysis
• There is an implicit network structure
gene
Condition
Clustering
• Motivation: true network should be naturally
modular
– Can be measured by modularity (Q)
– If constructed right, should have the highest Q
Method overview
Network series
Net_1,
Most
dense
Qcut
……
Microarray
data
Similarity
matrix
Net_m,
Qcut
Most
sparse
Method overview (cont’d)
Modularity
True network
Random network
Difference
Network density
• Therefore, use ∆Q to determine the best network
parameter and obtain the best community structure
• We actually run HQcut, a variant of Qcut, in order to avoid
resolution limit (Ruan & Zhang, Phys Rev E 2008)
Network construction methods
• Value-based method
– Remove edges with similarities < ε.
– Fixed ε for all vertices
– May have problem detecting weakly correlated modules
• Asymmetric k-nearest neighbors (aKNN)
–
–
–
–
Connect each vertex to k other vertices
Fixed k for all vertices (k < 10 good enough)
Minimum degree = k. max = ?
Sensitive to outliers
• Mutual k-nearest-neighbors (mKNN)
–
–
–
–
Association confirmed by both ends
Maximum degree = k, min = 0. (k larger than in aKNN.)
Outlier can be detected.
Ruan, ICDM 2009
Results: synthetic data set 1
• High dimensional data generated by synDeca.
– 20 clusters of high dimensional points, plus some
scatter points
– Clusters are of various shapes: eclipse, rectangle,
random
1
Accuracy
0.9
100
0.8
200
0.7
300
0.6
∆Q
400
0.5
500
0.4
600
0.3
QReal
0.2
QRandom
0.1
Qreal - Qrandom
700
800
Clustering Accuracy
900
0
1000
10
20
30
40
50
60
70
80
90
100
0
50
100
150
200
Number of neighbors
250
300
Comparison
mKNN-HQcut with the optimum k
1
mKNN-HQcut with automatically
determined k
Clustering Accuracy
0.8
0.6
0.4
This work
kmeans
optimal knn
HQcut
0.2
0
10
20
30
40
50
60
Dimension
70
80
90
100
Results: synthetic data set 2
• Gene expression data
– Thalamuthu et al, 2006
– 600 data sets
– ~600 genes, 50
conditions,
15 clusters
Without outliers
– 0 or 1x outliers mKNN-HQcut
With optimal k
mKNN-HQcut
With auto k
With outliers
Comparison with other methods
Results on yeast stress response
data
• 3000 genes, 173 samples
Best k = 140. Resulting in 75 clusters
Results on yeast stress response
data
• Enrichment of common functions
– Accumulative hyper-geometric test
Protein biosynthesis
(p < 10-96)
Peroxisome (p < 10-13)
Gene
Nuclear transport (p < 10-50)
mt ribosome (p < 10-63)
DNA repair (p < 10-66)
RNA splicing (p < 10-105)
Nitrogen compound
metabolism (p < 10-37)
GO Function Terms
ComparisonUsing
with
k-means
automatically determined k = 140
Overall function coherence
mkNN-HQcut
K-means
Application to Arabidopsis data
• ~22000 genes, 1138
samples
• 1150 singletons
• 800 (300) modules of
size >= 10 (20)
• > 80% (90%) of
modules have
enriched functions
• Much more significant
than all five existing
studies on the same
data set
Top 40 most significant modules
Cis-regulatory network of Arabidopsis
Motif
Module
Beyond gene clusters (1)
• Gene specific studies
– Collaborator is interested in Gibberellins
– A hormone important for the growth and development
of plant
– Commercially important
– Biosynthesis and signaling well studied
– Transcriptional regulation of biosynthesis and
signaling not yet clear
– 3 important gene families, GA20ox, GA3ox and
GA2ox for biosynthesis
– Receptor gene family: GID1A,B,C
– Analyze the co-expression network around these
genes
20ox
GID1C
GID1A
3ox
20ox5
GA3
GID1B
2ox
2ox6
2ox4
2ox8
2ox2
20ox1
3ox2
3ox4
3ox3
2ox3
20ox3
20ox4
20ox2
2ox1
3ox1
2ox7
Beyond gene clusters (2)
Sample
• Cancer classification
Gene
Sample
Sample: tumor/normal cells
Alizadeh et. al. Nature, 2000
Qcut
Network of cell samples
Black: normal cells
Blue: tumor cells
Follicular
Transformed cell lines
lymphoma (FL)
Activated
Blood B
DLBCL
DLBCL
Resting
Blood B
Blood T
Chronic lymphocytic
leukemia (CLL)
Diffuse large B-cell Lymphoma
(DLBCL)
Survival rate after chemotherapy
Survival rate: 73%
Median survival time: 71.3 months
Survival rate: 40%
Median survival time: 22.3 months
DLBCL-2
DLBCL-1
DLBCL-3
Survival rate: 20%
Median survival time: 12.5 months
Beyond gene clustering (3)
% of essential proteins
• Topology vs function
Jeong et. al. Nature 2001
Number of connections
Hub
Non-hub
% Essential
% Essential
Community participation vs. essentiality
Participation < 0.2
Participation >= 0.2
Community participation
Number of connections
• Key: how to systematically search for such relationships?
Download