Lecture 3

advertisement
LECTURE 3
Introduction to PCA and PLS
K-mean clustering
Protein function prediction using network concepts
Network Centrality measures
Handling Multivariate data
Student
Math
Chem
Phy
Bio
Eco
Soc
A
7
8
7
8
7
7
B
8
7
7
6
8
7
C
9
7
8
7
6
7
D
7
7
7
7
9
8
E
7
6
6
6
8
8
F
7
7
7
7
8
8
G
6
6
6
7
7
7
H
9
8
8
6
6
6
I
8
8
8
7
6
6
J
7
7
6
6
8
9
Multivariate data example
Principle Component Analysis (PCA) and
Partial Least Square (PLS)
• Two major common effects of using PCA or PLS
 Convert a group of correlated predictive variables to a
group of independent variables
 Construct a “strong” predictive variable from several
“weaker” predictive variables
• Major difference between PCA and PLS
 PCA is performed without a consideration of the target
variable. So PCA is an unsupervised analysis
 PLS is performed to maximiz the correlation between the
target variable and the predictive variables. So PLS is a
supervised analysis
PCA
A
(n x p)
PLS
X
(n x p)
1
PC
(n x p)
T
(n x c)
Y
(n x q)
2
1
U
(n x c)
max cov.
1 Decomposition step
2 Regression step
A = data matrix
PC = principal component matrix
n = # of observations
p = # of variables
X = matrix of predictors
Y = matrix of responses
T = factors of predictors
U = factors of responses
n = # of observations
p = # of predictors
q = # of responses
c = # of extracted factors
Principle Component Analysis (PCA)
 In Principal Component Analysis, we look for a few linear combinations of the
predictive variables which can be used to summarize the data without loosing too
much information.
 Intuitively, Principal components analysis is a method of extracting information
from a higher dimensional data by projecting it to a lower dimension.
Example: Consider the scatter plot of a 3-dimentional data (3 variables). Data across the 3
variables are higly correlated and majority of the points cluster around the center of the
space. This is also the direction of the 1st PC, which roughly gives equal weight to 3
variables
PC1 = – 0.56 X1 – 0.57 X2 – 0.59 X3
Properties of Principal Components
• Var(PCi) = i
• Cov(PCi,PCj) = 0
• Var(PC1)  Var(PC2)  … Var(PCp)
Numerical Example
Student
Math
Chem
Phy
Bio
Eco
Soc
A
7
8
7
8
7
7
B
8
7
7
6
8
7
C
9
7
8
7
6
7
D
7
7
7
7
9
8
E
7
6
6
6
8
8
F
7
7
7
7
8
8
G
6
6
6
7
7
7
H
9
8
8
6
6
6
I
8
8
8
7
6
6
J
7
7
6
6
8
9
The following is the high school
grade of 10 students on 6 subjects
(scale 1-10)
• Math = Mathematics
• Chem = Chemistry
• Phy = Phisics
• Bio = Biology
• Eco = Economy
• Soc = Sociology
Results
PC1
PC2
PC3
PC4
PC5
PC6
Eigenvalue
3.020
0.708
0.497
0.219
0.167
0.023
Proportion
0.652
0.153
0.107
0.047
0.036
0.005
Cumulative
0.652
0.804
0.912
0.959
0.995
1
Eigenvectors
Math
0.461
0.621
-0.088
0.168
0.267
-0.542
Chem
0.302
-0.059
-0.594
0.016
-0.740
-0.074
Phy
0.428
0.110
-0.365
-0.064
0.386
0.720
Bio
0.054
-0.666
-0.410
0.248
0.445
-0.355
Eco
-0.533
0.271
-0.526
-0.559
0.185
-0.140
Soc
-0.475
0.286
-0.248
0.771
-0.020
0.192
Partial Least Squares (PLS)
• Unlike PCA, the PLS technique works by
successively extracting factors from both
predictive and target variables such that
covariance between the extracted factors is
maximized
• Decomposition step
 X = TWt + E
 Y = UVt + F
• Regression step
 Y = TB + D = XWB + D = XBPLS + D; BPLS = WB
Numerical Example
Student
Math
Chem
Phy
Bio
Eco
Soc
GPA
A
7
8
7
8
7
7
2.9
B
8
7
7
6
8
7
3.1
C
9
7
8
7
6
7
3.6
D
7
7
7
7
9
8
3.3
E
7
6
6
6
8
8
3.0
F
7
7
7
7
8
8
2.9
G
6
6
6
7
7
7
3.2
H
9
8
8
6
6
6
3.4
I
8
8
8
7
6
6
2.8
J
7
7
6
6
8
9
3.5
The following is the high school
grade of 10 students on 6 subjects
(scale 1-10)
• Math = Mathematics
• Chem = Chemistry
• Phy = Phisics
• Bio = Biology
• Eco = Economy
• Soc = Sociology
and the corresponding GPA score
during undergraduate level.
Objective:
Can we use information of student’s performance during high school to predict
their GPA score when they enter undergraduate level?
K-mean clustering
Source: “Clustering Challenges in Biological Networks” edited by S. Butenko et. al.
Source:
Teknomo, Kardi. K-Means Clustering Tutorials
http:\\people.revoledu.com\kardi\
tutorial\kMean\
1. Initial value of
centroids: Suppose
we use medicine A
and medicine B as
the first centroids. Let
c1 and c2 denote the
coordinate of the
centroids, then c1 =
(1,1) and c2 = (2,1)
Protein function prediction using network
concepts
Topology of Protein-protein interaction is informative but
further analysis can reveal other information.
A popular assumption, which is true in many cases is that
similar function proteins interact with each other.
Based on these assumption, we have developed methods to
predict protein functions and protein complexes from the PPI
networks mainly based on cluster analysis.
Cluster Analysis
Cluster Analysis, also called data segmentation, implies grouping
or segmenting a collection of objects into subsets or "clusters",
such that those within each cluster are more closely related to
one another than objects assigned to different clusters.
In the context of a graph densely connected nodes are
considered as clusters
Visually we can detect two clusters in this graph
K-cores of
Protein-Protein Interaction Networks
Definition
Let, a graph G=(V, E) consists of a finite set of
nodes V and a finite set of edges E.
A subgraph S=(V, E) where V V and E  E
is a k-core or a core of order k of G if and only
if  v  V: deg(v)  k within S and S is the
maximal subgraph of this property.
Concept of a k-core graph
Graph G
1-core graph: The degree of all nodes are one or more
Concept of a k-core graph
1-core graph: The degree of all nodes are one or more
Concept of a k-core graph
2-core graph: The degree of all nodes are two or more
Concept of a k-core graph
1-core graph: The degree of all nodes are one or more
Graph G
3-core graph: The degree of all nodes are three or more
The 3-core is the highest k-core subgraph of the graph G
Application of a k-core graph
Analyzing protein-protein interaction data obtained from
different sources, G. D. Bader and C.W.V. Hogue, Nature
biotechnology, Vol 20, 2002
Protein function prediction using k-core graphs
Introduction : Function prediction
Schwikowski, B., Uetz, P. and Fields, S. A network of proteinprotein interactions in yeast. Nature Biotech. 18, 1257-1261
(2000)
Deals with a network of 2039 proteins and 2709 interactions.
65% of interactions occurred between protein pairs with
at least one common function
Hishigaki, H., Nakai, K., Ono, T., Tanigami, A., and Tagaki, T.
Assessment of prediction accuracy of protein function from
protein-protein interaction data. Yeast 18, 523-531 (2001)
Reported similar results..
Introduction : Function prediction
Hypothesis
Unknown function proteins that form densely connected
subgraph with proteins of a particular function may belong to that
CLASS A
functional group.
UNCLASSIFIED
PROTEINS
We utilize this concept by determining k-cores of strategically
constructed sub-networks.
33
Prediction of Protein Functions Based on K-cores of
Protein-Protein Interaction Networks
“Prediction of Protein Functions Based on K-cores of
Protein-Protein Interaction Networks and Amino Acid
Sequences”, Md. Altaf-Ul-Amin, Kensaku Nishikata,
Toshihiro Koma, Teppei Miyasato, Yoko Shinbo, Md.
Arifuzzaman, Chieko Wada, Maki Maeda, Taku Oshima,
Hirotada Mori, Shigehiko Kanaya The 14th International
Conference on Genome Informatics December 14-17,
2003, Yokohama Japan.
E.Coli PPI network
Total 3007
proteins and
11531
interactions
Around 2000 are
unknown
function proteins
Highest K-core of
this total graph is
not so helpful
10-core graph—the highest k-core of the E.Coli PPI
network
We separate 1072 interactions (out of 11531) involving protein
synthesis and function unknown proteins.
P. S.
P. S.
U. F.
P. S.
Function unknown Proteins of this 6-kore graph are likely to be involved
in protein synthesis
Unknown
Extending the k-core based function prediction method and its
application to PPI data of Arabidopsis thaliana
Protein Function Prediction based on k-cores of Interaction
Networks, Norihiko Kamakura, Hiroki Takahashi, Kensuke
Nakamura, Shigehiko Kanaya and Md. Altaf-Ul-Amin,
Proceedings of 2010 International Conference on
Bioinformatics and Biomedical Technology (ICBBT 2010)
Materials and Methods : Dataset
All PPI data of Arabidopsis thaliana
•3118 interactions
involving 1302 proteins.
• Collected from databases
and scientific literature by
our laboratory.
Green= Unknown proteins
(289 proteins)
Pink= Known proteins
(1013 proteins)
40
Materials and Methods : Dataset
Functional groups in the network
The PPI dataset contains proteins of 19 different functions according to the first level
categories of the KNApSAcK database.
function names
CELL CYCLE AND DNA PROCESSING
CELL FATE
CELL RESCUE, DEFENSE AND VIRULENCE
CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION MECHANISM
CONTROL OF CELLULAR ORGANIZATION
DEVELOPMENT (Systemic)
ENERGY
Endoplasmic reticulum biogenesis
METABOLISM
Mitochondria biogenesis
PROTEIN ACTIVITY REGULATION
PROTEIN FATE (folding, modification, destination)
PROTEIN SYNTHESIS
REGULATION OF / INTERACTION WITH CELLULAR ENVIRONMENT
STORAGE PROTEIN
SYSTEMIC REGULATION OF / INTERACTION WITH ENVIRONMENT
TRANSCRIPTION
TRANSPORT FACILITATION
UNCLASSIFIED PROTEINS
number of proteins
69
5
32
171
3
9
51
4
120
4
1
112
20
1
1
2
362
46
289
41
Materials and Methods : Dataset
The trends of interactions in the context of functional similarity
Diagonal elements show number of interactions between similar function proteins.
function name
No No 1
2 3
4
5 6 7 8
9 10 11 12 13 14 15 16 17 18 19
METABOLISM
1
72 23 1
9 10 0 1 0 67 0 29 0 4 3 0 0 0 0 0
UNCLASSIFIED PROTEINS
2
23 82 19 166 279 9 3 4 189 0 35 0 35 16 0 0 0 0 1
CELL RESCUE, DEFENSE AND VIRULENCE
3
1 19 9 15
7 0 0 0 38 0 1 0 3 4 0 0 0 0 0
TRANSCRIPTION
4
9 166 15 689 64 6 1 0 354 0 2 3 22 7 0 0 0 1 0
PROTEIN FATE (folding, modification, destination)
5
10 279 7 64 137 0 9 2 20 0 22 2 7 5 0 0 0 0 0
DEVELOPMENT (Systemic)
6
0
9 0
6
0 1 0 0
1 0 0 0 0 2 0 0 0 0 0
CELL FATE
7
1
3 0
1
9 0 1 0
2 0 0 0 0 1 0 0 0 0 0
PROTEIN SYNTHESIS
8
0
4 0
0
2 0 0 17
2 0 1 0 1 1 0 0 0 0 0
CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION MECHANISM9
67 189 38 354 20 1 2 2 374 0 24 0 35 11 0 0 1 1 0
Mitochondria biogenesis
10
0
0 0
0
0 0 0 0
0 3 0 0 0 0 0 0 0 0 0
ENERGY
11
29 35 1
2 22 0 0 1 24 0 64 0 3 8 0 0 0 0 0
SYSTEMIC REGULATION OF / INTERACTION WITH ENVIRONMENT 12
0
0 0
3
2 0 0 0
0 0 0 0 0 0 0 0 0 0 0
CELL CYCLE AND DNA PROCESSING
13
4 35 3 22
7 0 0 1 35 0 3 0 44 2 2 0 0 0 0
TRANSPORT FACILITATION
14
3 16 4
7
5 2 1 1 11 0 8 0 2 17 0 2 0 0 3
CONTROL OF CELLULAR ORGANIZATION
15
0
0 0
0
0 0 0 0
0 0 0 0 2 0 1 0 0 0 0
REGULATION OF / INTERACTION WITH CELLULAR ENVIRONMENT 16
0
0 0
0
0 0 0 0
0 0 0 0 0 2 0 0 0 0 0
PROTEIN ACTIVITY REGULATION
17
0
0 0
0
0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
STORAGE PROTEIN
18
0
0 0
1
0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
Endoplasmic reticulum biogenesis
19
0
1 0
0
0 0 0 0
0 0 0 0 0 3 0 0 0 0 6
42
Materials And Methods : Flowchart of the method
Input: A PPI network
Make a sub-network corresponding to a functional group
Remove the components consisting of only unknown proteins
Determine k-cores and assign the corresponding function to
the unknown proteins included in the k-cores(for k =3 or more)
Output: Predicted functions for some unknown proteins
43
Results : Subnetworks
Subnetwork Name
Number of interactions
we do not consider in this work the sub-networks that contain less than 100 interactions.
And finally I consider subnetworks corresponding to 9 functional classes.
44
Results : Subnetwork corresponding to
cellular communication
As an example here we show the subnetworks and k-cores corresponding
to cellular communication.
Subnetwork extraction
We extracted the following 3 types of interactions.
Cellular communication-Cellular communication
Cellular communication-Unknown,
Unknown-Unknown
Total 603 interactions
45
Results : Subnetwork corresponding to
cellular communication
1-core
The red nodes : known proteins.
The green nodes : unknown proteins.
46
Results : k-cores corresponding to cellular
communication
2-core
3-core
The red nodes : known proteins.
The green nodes : unknown proteins.
The red color nodes represent known proteins, the green color nodes represent function
unknown proteins.
47
Results : k-cores corresponding to cellular
communication
4-core
5-core
The red nodes : known proteins
The green nodes : unknown
proteins.
6-core
7-core
This figure implies that determination
of k-cores in strategically constructed
sub-networks can reveal which
unknown proteins are densely
connected to proteins of a particular
functional class.
48
Results : Function Predictions
The number of unknown genes included in different k-cores
corresponding to different functional groups
k-core 2
cell_cycle
11
cell_rescue
4
cellular_communicati
on
k-core 3 k-core 4 k-core 5 k-core 6 k-core 7 k-core 8
7
37
33
23
15
12
8
energy
5
2
2
2
2
2
metabo
5
1
1
69
35
25
25
15
10
24
14
11
8
8
88
64
52
36
27
protein_fate
protein_synthesis
transcription
transport_facilitation
total
2
2
33
2
129
2
49
Results : Function Predictions
Prediction based on 2-cores, 3-cores and 4-cores
2-core
4-core
Most proteins have been assigned
unique functions
CELL CYCLE AND DNA PROCESSING
CELLULAR COMMUNICATIO/SIGNAL TRANDUCTION
CELL RESCUEM, SEFENSE AND VIRULENCE
ENERGY
3-core
Most proteins have been assigned
unique functions and some have
been assigned multiple functions
METABOLISM
PROTEIN FATE (folding, modification, destination)
PROTEIN SYNTHESIS
TRANSCRIPTION
TRANSPORT FACILITATION
50
Assessment of Predictions
As most of the function predicted proteins are still unknown
their annotations do not contain clear information on their
functions.
When k is much larger than one, the effect of false positives is
greatly reduced.
However to assess statistically, we constructed 1000 random
graphs consisting of the same 1,302 proteins but I inserted
3,118 edges randomly and constructed subnetworks.
51
Assessment of Predictions
Cell Cycle
Energy
Protein
Synthesis
Cell Rescue
Metabolism
Transcription
Cellular
Communication
Protein fate
Transport
The box plots show the distribution of k-cores with respect to their size in 1000 graphs
corresponding to each sub-network and the filled triangles show the size of k-cores in real
PPI sub-networks.
Assessment of Predictions
•it can be theoretically concluded that the existence of higher
order k-core graphs in PPI sub-networks compared to in the
random graphs of the same size are likely to be because of
interaction between similar function proteins.
•Therefore we assume that the function prediction based on kcores for the value of k greater than highest possible value of k for
corresponding random graphs are statistically significant
predictions.
• Based on this we predicted the functions of 67 proteins(list is
available online at
http://kanaya.naist.jp/Kcore/supplementary/Function_prediction
.xls.
53
53
“Prediction of Protein Functions Based on ProteinProtein Interaction Networks: A Min-Cut
Approach”, Md. Altaf-Ul-Amin, Toshihiro Koma,
Ken Kurokawa, Shigehiko Kanaya, Proceedings of
the Workshop on Biomedical Data Engineering
(BMDE), Tokyo, Japan, pp. 37-43, April 3-4, 2005.
Outline
•Introduction
•The concept of Min-Cut
•Problem Formulation
•A Heuristic Method
•Evaluation of the Proposed Method
•Conclusions
Outline
•Introduction
•The concept of Min-Cut
•Problem Formulation
•A Heuristic Method
•Evaluation of the Proposed Method
•Conclusions
Introduction
After the complete sequencing of several genomes, the
challenging problem now is to determine the functions of
proteins
1) Determining protein functions experimentally
2) Using various computational methods
a) sequence
b) structure
c) gene neighborhood
d) gene fusions
e) cellular localization
f) protein-protein interactions
Introduction
Present work predicts protein functions based on proteinprotein interaction network.
For the purpose of prediction, we consider the interactions of
•function-unknown proteins with function-known proteins and
• function-unknown proteins with function-unknown proteins
In the context of the whole network.
Introduction
Majority of protein-protein interactions are between similar
function protein pairs.
Therefore,
We assign function-unknown proteins to different
functional groups in such a way so that the number of
inter-group interactions becomes the minimum.
Hence we call the proposed approach a Min-Cut
approach.
Outline
•Introduction
•The concept of Min-Cut
•Problem Formulation
•A Heuristic Method
•Evaluation of the Proposed Method
•Conclusions
The concept of Min-Cut
U4
K1
U3
K4
K2
U2
K3
U1
G1
K8
K6
K5
G2
A typical and small network of known and unknown proteins
The concept of Min-Cut
U4
K
U3
K
K
K
U2
K
K
U1
K
G1
G2
Unknown proteins assigned to known groups based on
majority interactions
The concept of Min-Cut
U4
K
U3
K
K
K
U2
K
K
U1
K
G1
G2
Number of CUT = 4
The concept of Min-Cut
U4
K
U3
K
K
K
U2
K
K
U1
K
G1
G2
An alternative assignment of unknown proteins
The concept of Min-Cut
U4
K
U3
K
K
K
U2
K
K
U1
K
G1
Number of CUT = 2
G2
For every assignment of unknown proteins, there is a value of CUT.
Min-cut approach looks for an assignment for which the number of
CUT is minimum.
Outline
•Introduction
•The concept of Min-Cut
•Problem Formulation
•A Heuristic Method
•Evaluation of the Proposed Method
•Conclusions
Problem Formulation
Let G , G ,……..,G are n sets/groups of functionknown proteins such that all proteins of a group are
of similar function. Multiple function proteins are
members of more than one group. Therefore, the set
of all function-known proteins G  G . The set of
function-unknown proteins is denoted byU . N (V , E ) is
a graph/network where v  V is a node representing a
protein and e (v , v )  E is an edge representing…….
1
2
n
n
k 1
k
i
ij
i
j
Here we explain some points with a typical example.
Problem Formulation
U8
K9
U7
N (V , E )
K10
U6
G3
K8
V= set of all nodes
U5
E =set of all edges
U4
K1
U3
K3
K2
K7
U2
K4
K6
U1
K5
G1
G2
G={K1, K2, K3, K4, K5, K6, K7, K8, K9, K10}
U={U1, U2, U3, U4, U5, U6, U7, U8}
Problem Formulation
U8
We generate U´ U
such that each
protein of U´ is
connected in N with
at least one protein of
group G by a path of
length 1 or length 2.
K9
U7
K10
U6
G3
K8
U5
U4
K1
K3
K2
K7
U3
U2
K4
K6
U1
K5
G1
G2
U´= {U1, U2, U3, U4, U5, U6, U7}
Problem Formulation
U8
K9
U7
K10
U6
G3
K8
We can assign
proteins of U´ to
different groups
and calculate
CUT
U5
U4
K1
K3
K2
K7
U3
U2
K4
K6
U1
Interactions between
known protein pairs
can never be part of
CUT
K5
G1
G2
For this assignment of unknown proteins, the CUT= 6
Problem Formulation
The problem we are trying to solve is to
assign the proteins of set U´ to known
groups G1 , G2 ,…….., G3 in such a way so
that the CUT becomes the minimum.
Outline
•Introduction
•The concept of Min-Cut
•Problem Formulation
•A Heuristic Method
•Evaluation of the Proposed Method
•Conclusions
A Heuristic Method
•The problem under hand is a variant of network partitioning
problem.
•It is known that network partitioning problems are NP-hard.
•Therefore, we resort to some heuristics to find a solution as
better as it is possible.
A Heuristic Method
min_cut = |E|
iteration = 0
Make a table for each protein of U containing
maximum 3 IDs of respective priority groups
U1
Assign each protein of Uto some randomly or intentionally
chosen group from among its priority groups
U2
U3
Calculate CUT
CUT < min_cut
YES
min_cut = CUT
Record the current
assignment
NO
iteration = iteration + 1
YES
iteration < max_value
NO
Print min_cut, corresponding assignment and Exit
U4
U5
U6
U7
A Heuristic Method
U8
K9
U7
U6
G3
K8
U5
K7
U3
K3
K2
U2
U3
U4
K1
U1 G2 G1 x
K10
U2
K4
U5
K6
U1
K5
G1
U4
U6
U7
G2
U1 has one path of length 1 with G2 and two paths of length
two with G1
A Heuristic Method
U8
K9
U7
U6
G3
K8
U5
K7
U3
K3
K2
U2 G2 G1 x
U3 G2 G1 x
U4
K1
U1 G2 G1 x
K10
U2
K4
U5
K6
U1
K5
G1
U4 G1 G2 G3
U6
U7
G2
U4 has two paths of length 1 with G1, one path of length one
with G2 and one path of length two with G3.
A Heuristic Method
U8
K9
U7
U6
G3
K8
U5
K7
U3
K3
K2
U2 G2 G1 x
U3 G2 G1 x
U4
K1
U1 G2 G1 x
K10
U2
K4
U5 G1 G2 G3
K6
U1
K5
G1
G2
U4 G1 G2 G3
U6 G1 G3 G2
U7 G3 G2 x
A Heuristic Method
min_cut = |E|
iteration = 0
U1 G2 G1 x
Make a table for each protein of U containing
maximum 3 IDs of respective priority groups
U2 G2 G1 x
Assign each protein of Uto some randomly or intentionally
chosen group from among its priority groups
U4 G1 G2 G3
Calculate CUT
CUT < min_cut
U3 G2 G1 x
YES
min_cut = CUT
Record the current
assignment
U5 G1 G2 G3
NO
U6 G1 G3 G2
iteration = iteration + 1
U7 G3 G2 x
YES
iteration < max_value
NO
Print min_cut, corresponding assignment and Exit
A Heuristic Method
U8
K9
U7
U6
G3
K8
U5
K7
U3
K3
K2
U2 G2 G1 x
U3 G2 G1 x
U4
K1
U1 G2 G1 x
K10
U2
K4
K6
U4 G1 G2 G3
U5 G1 G2 G3
U6 G1 G3 G2
U1
K5
G1
U7 G3 G2 x
G2
By assigning all the unknown proteins to respective height
priority groups, CUT = 6
A Heuristic Method
U8
K9
U7
U6
G3
K8
U5
K7
U3
K3
K2
U2 G2 G1 x
U3 G2 G1 x
U4
K1
U1 G2 G1 x
K10
U2
K4
U5 G1 G2 G3
K6
U1
K5
G1
U4 G1 G2 G3
U6 G1 G3 G2
U7 G3 G2 x
G2
For this assignment of unknown proteins, the CUT= 7
A Heuristic Method
U8
K9
U7
U6
G3
K8
U5
K7
U3
K3
K2
U2 G2 G1 x
U3 G2 G1 x
U4
K1
U1 G2 G1 x
K10
U2
K4
U5 G1 G2 G3
K6
U1
K5
G1
U4 G1 G2 G3
U6 G1 G3 G2
U7 G3 G2 x
G2
For this assignment of unknown proteins, the CUT= 4
Outline
•Introduction
•The concept of Min-Cut
•Problem Formulation
•A Heuristic Method
•Evaluation of the Proposed Method
•Conclusions
Evaluation of the Proposed Approach
•The proposed method is a general one and can be
applied to any organism and any type of functional
classification.
•Here we applied it to yeast Saccharomyces cerevisiae
protein-protein interaction network
•We obtain the protein-protein interaction data from
ftp://ftpmips.gsf.de/yeast/PPI/ which contains 15613
genetic and physical interactions.
Evaluation of the Proposed Approach
We
discard
selfinteractions and extract a
set of 12487 unique binary
interactions involving 4648
proteins.
YAR019c
YMR001c
YAR019c
YNL098c
YAR019c
YOR101w
YAR019c
YPR111w
YAR027w
YAR030c
YAR027w
YBR135w
YAR031w
YBR217w
-------------
-------------
-------------
-------------
Total 12487 pairs
Evaluation of the Proposed Approach
A network of 12487 interactions and 4648 proteins is reasonably big
Evaluation of the Proposed Approach
We collect from http://mips.gsf.de/genre/proj/yeast/index.jsp the
classification data
Name of functional class
METABOLISM
ENERGY
CELL CYCLE AND DNA
PROCESSING
TRANSCRIPTION
PROTEIN SYNTHESIS
PROTEIN FATE (folding, modification,
destination)
PROTEIN WITH BINDING
FUNCTION OR COFACTOR
REQUIREMENT (structural or catalytic)
PROTEIN ACTIVITY REGULATION
CELLULAR TRANSPORT,
TRANSPORT FACILITATION AND
TRANSPORT ROUTES
CELLULAR
COMMUNICATION/SIGNAL
TRANSDUCTION MECHANISM
CELL RESCUE, DEFENSE AND
VIRULENCE
INTERACTION WITH THE
CELLULAR ENVIRONMENT
TRANSPOSABLE ELEMENTS,
VIRAL AND PLASMID PROTEINS
BIOGENESIS OF CELLULAR
COMPONENTS
CELL TYPE DIFFERENTIATION
# of
proteins
984
260
690
842
381
631
39
27
719
94
296
336
118
451
339
Evaluation of the Proposed Approach
Name of functional class
METABOLISM
ENERGY
CELL CYCLE AND DNA
PROCESSING
TRANSCRIPTION
PROTEIN SYNTHESIS
PROTEIN FATE (folding, modification,
destination)
PROTEIN WITH BINDING
FUNCTION OR COFACTOR
REQUIREMENT (structural or catalytic)
PROTEIN ACTIVITY REGULATION
CELLULAR TRANSPORT,
TRANSPORT FACILITATION AND
TRANSPORT ROUTES
CELLULAR
COMMUNICATION/SIGNAL
TRANSDUCTION MECHANISM
CELL RESCUE, DEFENSE AND
VIRULENCE
INTERACTION WITH THE
CELLULAR ENVIRONMENT
TRANSPOSABLE ELEMENTS,
VIRAL AND PLASMID PROTEINS
BIOGENESIS OF CELLULAR
COMPONENTS
CELL TYPE DIFFERENTIATION
# of
proteins
984
260
690
842
381
631
39
27
719
94
296
336
118
451
339
•The proposed approach is
intended to predict the functions
of function-unknown proteins.
•However, by predicting the
functions of function-unknown
proteins, it is not possible to
determine the correctness of the
predictions.
•We consider around 10%
randomly selected proteins of
each group of Table 1 as
function-unknown proteins.
Evaluation of the Proposed Approach
Name of functional class
METABOLISM
ENERGY
CELL CYCLE AND DNA
PROCESSING
TRANSCRIPTION
PROTEIN SYNTHESIS
PROTEIN FATE (folding, modification,
destination)
PROTEIN WITH BINDING
FUNCTION OR COFACTOR
REQUIREMENT (structural or catalytic)
PROTEIN ACTIVITY REGULATION
CELLULAR TRANSPORT,
TRANSPORT FACILITATION AND
TRANSPORT ROUTES
CELLULAR
COMMUNICATION/SIGNAL
TRANSDUCTION MECHANISM
CELL RESCUE, DEFENSE AND
VIRULENCE
INTERACTION WITH THE
CELLULAR ENVIRONMENT
TRANSPOSABLE ELEMENTS,
VIRAL AND PLASMID PROTEINS
BIOGENESIS OF CELLULAR
COMPONENTS
CELL TYPE DIFFERENTIATION
# of
proteins
984
260
690
842
381
631
39
27
719
94
296
336
118
451
339
•The union of 10% of all groups
consists of 604 proteins. This is the
unknown group U.
•The union of the rest 90% of each
of the functional groups constitutes
the set of known proteins G. There
are total 3783 proteins in G.
•We generate U´ U such that each
protein of U´ is connected in N with
at least one protein of group G by a
path of length 1 or length 2. There
are 470 proteins in U´ .
•We predicted functions of these 470
proteins using the proposed method.
Evaluation of the Proposed Approach
min_cut = |E|
iteration = 0
Make a table for each protein of U containing
maximum 3 IDs of respective priority groups
Assign each protein of Uto some randomly or intentionally
chosen group from among its priority groups
Calculate CUT
CUT < min_cut
YES
min_cut = CUT
Record the current
assignment
NO
iteration = iteration + 1
YES
iteration < max_value
NO
Print min_cut, corresponding assignment and Exit
We applied this
algorithm using
Max_value=50000 to
predict the functions
470 proteins.
Evaluation of the Proposed Approach
•We cannot guarantee that minimum CUT corresponds to
maximum successful prediction.
•However, the trends of the results of the Figure above
shows that it is very likely that the lower is the value of
CUT the greater is the number of successful predictions
Evaluation of the Proposed Approach
We then examine the relation of successful predictions with
the number of degrees of the proteins in the network .
U8
K9
U7
K10
U6
G3
K8
U5
U4
K1
K2
K7
U3
K3
U2
K4
K6
U1
K5
G1
G2
Degree of U4 =7
Degree of U7=3
Evaluation of the Proposed Approach
We then examine the relation of successful predictions with
the number of degrees of the proteins in the network .
Evaluation of the Proposed Approach
Degree
1
2
3
4
5
6
7
>7
Total
Number of
proteins
128
80
60
33
23
24
17
105
470
Successful
prediction
39
39
32
24
15
14
12
71
246
•The success rate of
prediction is as low as 30.46%
for proteins that have only
one degree in the interaction
network.
Percentage
30.46
48.75
53.33
72.72
65.21
58.33
70.58
67.61
52.34
•However it is 67.61% for
proteins that have degrees 8
or more.
100
•This implies that the
reliability of the prediction
can be improved by providing
reasonable amount of
interaction information
Success Percentage
80
60
40
20
0
0
1
2
3
4
Degree
5
6
7
8
Centrality measures of nodes
Centrality measures
Within graph theory and network analysis, there are
various measures of the centrality of a vertex within a
graph that determine the relative importance of a
vertex within the graph.
We will discuss on the following centrality measures:
•Degree centrality
•Betweenness centrality
•Closeness centrality
•Eigenvector centrality
•Subgraph centrality
Degree centrality
Degree centrality is defined as the number of links incident
upon a node i.e. the number of degree of the node
Degree centrality is often interpreted in terms of the
immediate risk of the node for catching whatever is flowing
through the network (such as a virus, or some information).
Degree centrality of the
blue nodes are higher
Betweenness centrality
The vertex betweenness centrality BC(v) of a vertex v is
defined as follows:
Here σuw is the total number of shortest paths between
node u and w and σuw(v) is number of shortest paths
between node u and w that pass node v
Vertices that occur on many shortest paths between other
vertices have higher betweenness than those that do not.
Betweenness centrality
σuw
a
c
d
b
f
e
Betweenness centrality of
node c=6
Betweenness centrality of
node a=0
σuw(v)
σuw/σuw(v)
(a,b)
1
0
0
(a,d)
1
1
1
(a,e) 1
1
1
(a,f)
1
1
1
(b,d) 1
1
1
(b,e) 1
1
1
(b,f)
1
1
1
(d,e) 1
0
0
(d,f)
1
0
0
(e,f)
1
0
0
Calculation for node c
Betweenness centrality
•Nodes of high
betweenness centrality
are important for
transport.
•If they are blocked,
transport becomes less
efficient and on the
other hand if their
capacity is improved
transport becomes
more efficient.
•Using a similar
concept edge
betweenness is
calculated.
Hue (from red=0 to blue=max)
shows the node betweenness.
http://en.wikipedia.org/wiki/Between
ness_centrality#betweenness
Closeness centrality
The farness of a vortex is the sum of the shortest-path
distance from the vertex to any other vertex in the graph.
The reciprocal of farness is the closeness centrality (CC).
1
CC (v) 
 d ( v, t )
t V \ v
Here, d(v,t) is the shortest distance between vertex v and
vertex t
Closeness centrality can be viewed as the efficiency of a
vertex in spreading information to all other vertices
Eigenvector centrality
Let A is the adjacency matrix of a graph and λ is the largest
eigenvalue of A and x is the corresponding eigenvector then
-----(1)
N×N N×1
|A-λI|=0, where I is an
identity matrix
N×1
The ith component of the eigenvector x then gives the eigenvector
centrality score of the ith node in the network.
From (1)
xi 
1
N
A


j 1
i, j
xj
•Therefore, for any node, the eigenvector centrality score be
proportional to the sum of the scores of all nodes which are
connected to it.
•Consequently, a node has high value of EC either if it is
connected to many other nodes or if it is connected to others that
themselves have high EC
Subgraph centrality
the number of closed
walks of length k starting
and ending on vertex i in
the network is given by
the local spectral
moments μ k (i), which
are simply defined as the
ith diagonal entry of the
kth power of the
adjacency matrix, A:
Subgraph Centrality in Complex
Networks, Physical Review E 71,
056103(2005)
Closed walks can be trivial or
nontrivial and are directly related to
the subgraphs of the network.
Subgraph centrality
01000000000000
10110100000000
01011100000000
01101101000000
00110100000000
01111010000000
M=
00000100001000
00010000100000
Muv = 1 if there is an edge between
nodes u and v and 0 otherwise.
00000001010011
00000000101011
00000010010000
00000000000010
00000000110101
00000000110010
Adjacency matrix
Subgraph centrality
10110100000000
04223211000000
12432311000000
12352310100000
03223211000000
12332501001000
M2 =
01111020010000
01101102010011
(M2)uv for uv represents the
number of common neighbor of the
nodes u and v.
00010000421122
local spectral moment
00000000110101
00000011240122
00000100102011
00000001221042
00000001221123
Subgraph centrality
The subgraph centrality of the node i is given by
Let λ be the main eigenvalue of the adjacency matrix A. It can be
shown that
Thus, the subgraph centrality of any vertex i is bounded above
by
Table 2.
Summary of
results of eight
real-world
complex
networks.
Software Open Access
Exploration of biological network centralities with CentiBiN
Björn H Junker, Dirk Koschützki* and Falk Schreiber
Address: Department of Molecular Genetics, Leibniz Institute
of Plant Genetics and Crop Plant Research (IPK), Corrensstr. 3,
06466 Gatersleben, Germany
Email: Björn H Junker - junker@ipk-gatersleben.de; Dirk
Koschützki* - koschuet@ipk-gatersleben.de; Falk Schreiber schreibe@ipkgatersleben.
de
BMC Bioinformatics 2006, 7:219 doi:10.1186/1471-2105-7-219
Download