Ruan Research Overview - University of Texas at San Antonio

advertisement
Network-based analysis
of functional genomics
data
Jianhua Ruan, PhD
Computer Science Department
University of Texas at San Antonio
http://www.cs.utsa.edu/~jruan
Final project
Final Project Report due Sat, Dec 15
 Presentations: Mon, Dec 17, 8-10:30 pm
 10 teams to present
 Each team will have up to 13 minutes. (10
min presentation, 3 min questions)
 Since time is limited, you don’t need to
cover all the details in your presentation.



Focus on the most important concepts
More details in your project report
UTSA
Human: ~22,000 genes
Dog: ~20,000 genes
Rice: ~35,000 genes
C. Elegans: ~20,000 genes
Mouse: ~22,000 genes
It is not just the genes, but the networks!
Why networks?

For complex systems, the actual output may not
be predictable by looking at only individual
components:


The whole is greater than the sum of its parts
Studying genes/proteins on the network level
allows us to:




Assess the role of individual genes/proteins in the
overall pathway
Evaluate redundancy of network components
Identify candidate genes involved in genetic diseases
Sets up the framework for mathematical models
UTSA
Graph model of biological networks


An abstract of the complex relationships
among molecules in the cell
Vertex: molecule


Edges: relationships



Gene, protein, metabolite, DNA, RNA
Physical interaction
Functional association
Share many common statistical properties with
(Jeong et al., 2001)
real-world networks




Small-world
Scale-free
Hierarchical
Modular (community structure)
UTSA
Genetic network reverse-engineering
Data Mining, tree models, DNA motif finding
Network analysis algorithms
Community discovery
Network-based disease studies
Research
Overview
Data integration, classification, graph algorithms
Agenda
Community discovery in biological
networks
 Network-based analysis of microarray data
 Network-based biomarker discovery for
metastatic breast cancer
 Conclusions

UTSA
Network communities

Communities



Are relatively densely connected sub-networks
(modules)
Appear in many types of networks
Independently studied in many fields:


Social science, Computer science, Physics, Biology, etc.
Significance

Biological systems are modular




Biological systems are large and complex


Metabolic pathways
Protein complexes
Transcriptional regulatory modules
Communities provide a high-level overview of the networks
Guilt-by-association

Predict gene functions based on community memberships
UTSA
Community discovery problem


Divide a network into relatively densely connected subnetworks
Similar to clustering, but # of clusters is determined
automatically
Vertex
reorder
UTSA
Modularity function (Q)

Measure strength of community structures

5
Newman, Phy Rev E, 2003
e11
10
e22
15
Q   (eii  eii )
20
e33
25
i
30
e44
35
40
e55
45
50
5
10
15
20
25
30
35
40
45
50
UTSA
Q = 0.45
Q=0
Q = 0.40
Q = 0.56
Q = 0.54
Modularity automatically determines # of communities!
Methods for community discovery

Previous methods





Fast but inaccurate (CNM, Phy Rev E, 2004)
Accurate but slow (Guimera&Amaral, Nature
2005)
Relatively accurate, relatively fast
(Ruan&Zhang, AAAI 2006, ICDM2007)
Relatively accurate, fast, memory intensive
(Newman, PNAS 2006)
Our new algorithm: Qcut/HQcut
(Ruan&Zhang, Phys Rev E 2008)


Accurate, fast and memory friendly
HQcut solves the resolution limit of Q
UTSA
Algorithm Qcut
1.
Recursive multi-way partitioning until Q is max
5
eig
5
kmeans
10
15
15
20
20
25
25
30
1
2
3
30
1
2
3
Improve Q by efficient heuristic search
Accuracy
2.
10
Our method
Newman’s
Inter-community edge probability
UTSA
Resolution limit and HQcut

Q is known to have a resolution limit
problem



Cannot detect small communities
Q slight decreases if forced to split
HQcut solves this problem



Apply Qcut to get communities with largest Q
Recursively search for sub-communities within
each community without dramatic change to Q
Statistical test for termination criteria
Ruan & Zhang, Physical Review E 2008
UTSA
Graphical user interface for Qcut/HQcut
UTSA
Application: protein complex prediction

Input: a yeast PPI network


Data from Krogan et.al., Nature.
2006;440:637-43
2708 vertices (proteins), 7123 interactions



289 communities
Sizes range from 2 to 49
Evaluation: compare
communities with known
complexes manually
curated in MIPS database
UTSA
Communities in a yeast PPI network
Small ribosomal
subunit (90%)
RNA poly II
mediator (83%)
Proteasome
core (90%)
gamma-tubulin
(77%)
Exosome (94%)
respiratory chain
complex IV (82%)
UTSA
Communities vs. complexes
Predicted complex



Communities
and complexes
have good one
to one
correspondence
Overall accuracy:
0.81
Newman: 0.58
Known complex
Agreement between a predicted complex (P) and a
known complex (K) = |P∩K| / sqrt(|P| x |K|).
UTSA
Work-in-progress: Random walk-based improvement

Motivation:



Three goals:




PPI network often contain both false positive and false
negative edges
Hub genes prevent good partitioning
Eliminate false positive edges
Predict missing links
Reduce the impact of hub genes
Intuition:


Two proteins with high topological similarity, regardless
of connected or not, may belong to same complex
Two proteins with direct link but very different
topological properties may belong to different complexes
UTSA
Method overview
Original network
Initial prob vectors
Equilibrium prob vectors
5
5
10
10
15
15
20
20
25
25
30
30
5
10
15
20
25
30
5
=
Random walk
with resistance
New network
Adjacency matrix
5
5
10
10
10
15
15
20
20
25
25
30
30
10
15
20
25
30
15
20
25
30
Distance
calculation
Similarity matrix
5
5
10
threshold
(guided by
topology)
15
20
25
30
5
10
15
20
25
30
5
10
15
20
25
30
UTSA
Preliminary results on yeast PPI

Predicted PPIs have much higher
functional relevance than removed PPIs,
using several sources of evidence




Gene Ontology
Gene expression
etc.
New network significantly improved
accuracy of protein complex predictions


Using HQcut: 0.50 to 0.55
Using MCL: 0.48 to 0.59
UTSA
Agenda

Community structure in biological
networks


Prediction of protein complexes
Network-based microarray data analysis
Network-based biomarker discovery for
metastatic breast cancer
 Conclusions

UTSA
Microarray data analysis


Gene network structure is unknown
Microarray measures gene expression (activity) level
genes
Conditions


Clustering
Clustering is the most common analysis tool
Many clustering algorithms available
 K-means
 Hierarchical

 Self organizing maps


 Parameter (e.g., k) hard to guess
 Does not consider network structure
Common functions?
Common regulation?
Predict functions for
unknown genes?
UTSA
Network-based microarray data analysis
Sample
Construct
Co-expression
network
Gene
i
j

Genes i and j connected if their expression
patterns are “sufficiently similar”



=
Pearson correlation coefficient > arbitrary
threshold
K nearest neighbors (KNN)
Key: how to get the “best” network?
UTSA
Motivation


Can we use the idea of community
discovery for clustering microarray data?
Advantages:



Parameter free
Network topology considered
Constructed network may have other
interesting applications beyond clustering
UTSA
Our idea
Ruan, ICDM 2009
Microarray
data
Network series
Similarity
matrix
Net_1, Qcut
Most
dense
……

Intuition: the real network is
naturally modular


Can be measured by modularity (Q)
If constructed right, should have the
highest Q
Net_m,
Qcut
Most
sparse
UTSA
Our idea (cont’d)
True network
Modularity
Random network
Difference
Network density
• Therefore, use ∆Q to determine the best network
parameter and obtain the best community structure
UTSA
Results: synthetic data set 1

High dimensional data generated by
synDeca.


20 clusters of high dimensional points, plus
some scatter points
Clusters are of various shapes: eclipse,
rectangle, random
1
Accuracy
0.9
100
0.8
200
0.7
300
0.6
∆Q
400
0.5
500
0.4
600
0.3
QReal
0.2
QRandom
0.1
Qreal - Qrandom
700
800
Clustering Accuracy
900
0
1000
10
20
30
40
50
60
70
80
90
100
0
50
100
150
200
Number of neighbors
250
300
UTSA
Comparison
mKNN-HQcut with the optimum k
1
mKNN-HQcut with automatically
determined k
Clustering Accuracy
0.8
K-means
0.6
0.4
This work
kmeans
optimal knn
HQcut
0.2
0
10
20
30
40
50
60
70
80
90
100
Dimension
UTSA
Results: synthetic data set 2

Gene expression data




Thalamuthu et al, 2006
600 data sets
~600 genes, 50
conditions, 15 clusters
0 or 1x outliers
Without outliers
With outliers
mKNN-HQcut
With optimal k
mKNN-HQcut
With auto k
UTSA
Comparison with other methods
Ruan et al., BioKDD 2010
UTSA
Results on yeast stress response data

3000 genes, 173 samples
Best k = 140. Resulting in 75 clusters
UTSA
Results on yeast stress response data

Enrichment of common functions

Accumulative hyper-geometric test (Fisher’s exact test)
Protein biosynthesis
(p < 10-96)
Peroxisome (p < 10-13)
Gene
Nuclear transport (p < 10-50)
mt ribosome (p < 10-63)
DNA repair (p < 10-66)
RNA splicing (p < 10-105)
Nitrogen compound
metabolism (p < 10-37)
GO Function Terms
UTSA
Comparison with k-means
Overall enrichment score
mkNN-HQcut
Using automatically determined k = 140
K-means
UTSA
An interesting community

A 25-gene community missed by other methods
4 telomere maintenance genes (p < 10-7)
16 unknown genes, all located in chromosome telomeric regions
5 other genes at rim of the sub-network
4 transcription factors regulate many genes in the community
UTSA
Application to Arabidopsis data





~22000 genes,
1138 samples
1150 singletons
800 (300) modules
of size >= 10 (20)
> 80% (90%) of
modules have
enriched functions
Much more
significant than all
five existing
studies on the
same data set
Top 40 most significant modules
UTSA
UTSA
Cis-regulatory network of Arabidopsis
Motif
Module
Ruan et al.,
BMC Bioinfo, to appear
UTSA
Beyond gene clustering (1)

Gene specific studies








Collaborator is interested in Gibberellins
A hormone important for the growth and
development of plant
Commercially important
Biosynthesis and signaling well studied
Transcriptional regulation of biosynthesis and
signaling not yet clear
3 important gene families, GA20ox, GA3ox and
GA2ox for biosynthesis
Receptor gene family: GID1A,B,C
Analyze the co-expression network around
these genes
UTSA
20ox
GID1C
GID1A
3ox
20ox5
GA3
GID1B
2ox
2ox6
2ox4
2ox8
2ox2
20ox1
3ox2
3ox4
3ox3
2ox3
20ox3
20ox4
20ox2
2ox1
3ox1
2ox7
UTSA
UTSA
Beyond gene clusters (2)

Sample
Cancer classification
Gene
Sample
Sample: patient or cell lines
Qcut
Alizadeh et. al. Nature, 2000
UTSA
Network of cell samples
Shape: cell line / cancer type
Color: clustering results
Transformed cell lines
Activated Follicular
Blood B lymphoma (FL)
DLBCL
DLBCL
Resting
Blood B
Blood T
Chronic lymphocytic
leukemia (CLL)
Diffuse large B-cell Lymphoma
(DLBCL)
UTSA
Survival rate after chemotherapy
Survival rate: 73%
Median survival time: 71.3 months
Survival rate: 40%
Median survival time: 22.3 months
DLBCL-2
DLBCL-1
DLBCL-3
Survival rate: 20%
Median survival time: 12.5 months
UTSA
Beyond gene clustering (3)
Topology vs function
% of essential proteins

Jeong et. al. Nature 2001
Number of connections
UTSA
Community participation vs. essentiality (PPI)
% Essential
Hub
Non-hub
Community participation
UTSA
Hub
Non-hub
% Essential
% Essential
Community participation vs. essentiality (coexp)
Participation < 0.2
Participation >= 0.2
Community participation

Number of connections
Key: how to systematically search for such relationships?

Data mining – association rule?
UTSA
Agenda

Community structure in biological
networks

Prediction of protein complexes
Network-based microarray data analysis
 Network-based biomarker discovery
for metastatic breast cancer
 Conclusions

UTSA
Background


Metastasis is the spread of cancer from one
organ to another non-adjacent organ or part.
Challenge: Predict Metastasis



If metastasis is likely => aggressive adjuvant
therapy
How to decide the likelihood?
Traditional predictive
factors are not good
enough
UTSA
Microarray-based marker discovery

Examine genome-wide expression profiles

Idea:



Limitations:




Score individual genes for how well they discriminate
between different classes of disease
Establish gene expression signature
# genes >> # patients
Downstream effects
Individual variations not
attributed to cancer
Consequences:


Low reproducibility across
data sets
Missing biological insight
M
N
UTSA
Pathway vs. PPI Subnetwork as Marker

Remedy to the problems: use
pathway information



Limitation:


Majority of human genes not yet
assigned to a definitive pathway
Alternatively: protein-protein
interaction (PPI) networks




Less features => better stability
Biological insight
Genes in the same pathway may have
short distance in PPI
Subnetworks (potential pathways) as
markers
Chuang et al. 2007, Mol Syst Bio.
Cannot differentiate causal vs.
downstream genes
UTSA
Our approach

Key observation: many known disease genes are
not differentially expressed (DE) between
metastasis and non-metastasis, but their
neighbors are



e.g P53 and BRAC2
Intuition: find a small number of intermediate
genes that connect DE genes
Known as the Steiner tree problem in computer
science
UTSA
Steiner tree problem

Given a connected graph and a list of input
nodes, find the smallest number of additional
nodes so that there is a tree connecting the input
nodes and those additional nodes
UTSA
Method overview
UTSA
Experiment setup

Two breast cancer microarray data sets



van de Vijver et al, 2002, (Agilent, 78 +ve vs
217 -ve)
Wang et al, 2005, (affy, 106 +ve vs 180 -ve)
Two human PPI networks


PINA (Wu et al 2009), 10,920 proteins and
61,746 interactions
Chuang et al 2007, 11,203 proteins and
57,235 interactions
UTSA
Steiner tree example for Wang data set
UTSA
Cross-dataset stability of markers
van de Vijver
dataset
Wang
dataset
Overlap
Number of common
genes(% of overlap)
p-value
Steiner
tree-based
markers
Chuang-PPI
1123
1100
384(20.88%)
2E-127
PINA-PPI
981
1135
400(23.31%)
5E-158
Top scoring
Steiner
tree-based
markers
Chuang-PPI
203
194
63(18.86%)
7E-064
PINA-PPI
164
185
80(29.74%)
3E-103
Differentially
expressed genes
324
319
34(5.58%)
1E-008
Approach in
Chuang et al
(2007)
618
906
175
(12.8%)
6E-054
Approach in
original study
70
76
3(2.09%)
.03
UTSA
Enriched cancer-related pathways

Enrichment score for Wang dataset with ChuangPPI network.
UTSA
Overlap with known breast cancer genes
(A)
(B)
Overlap of 60 known breast cancer genes with STMs, t-STMs, DE
genes, Chuang et al (2007) method and corresponding original
studies. (A) Wang (B) van de Vijver dataset.
UTSA
Classification accuracy
Classification accuracy based on AUC metric. The dataset label represents
features taken from that dataset. For cross-data classification features
taken from the labeled dataset and applied to the other dataset (A)
Bayesian logistic regression (B) random tree classifier.
UTSA
Conclusions

Methods for community discovery





Fully automated, i.e. parameter-free
Higher accuracy than competing methods that
require extensive parameter tuning
Improves protein complex prediction and
microarray data clustering, including tumor
subtype classification
Many other applications
Steiner-tree based biomarker discovery


Improves stability of metastatic breast cancer
markers, and cross-dataset classification
accuracy
A generic method for identifying the hidden
causal genes given downstream targets
UTSA
Future work
Network-based gene-specific studies
 Combine random walk with Steiner tree
algorithms to improve biomarker
discovery and cancer classification
 Other types of data
 Other diseases

UTSA
Questions?
UTSA
Download