desore1

advertisement
Inferring Functional
Information from Domain
co-evolution
Yohan Kim, Mehmet Koyuturk, Umut Topkara, Ananth Grama and
Shankar Subramaniam
Gaurav Chadha
Deepak Desore
Layout
 Motivation
 Computational Methods and Algorithms
 Results
 Conclusion
 Questions
Motivation (1 of 2..)
 Prior Work

Focused on understanding Protein function at the level
of entire protein sequences

Assumption: Complete Sequence follows single
evolutionary trajectory
 It is well known that a domain can exist in various
contexts, which invalidates the above assumption for
multi-domain protein sequences
Motivation (2 of 2 ..)
 Our approach



Improvement of Multiple Profile method
Constructs Co-evolutionary Matrix to assign
phylogenetic similarity scores to each protein
pair
Identifies Co-evolving regions using residuelevel conservation
Computational Methods &
Algorithms
 Constructing phylogenetic profiles



Protein(single) phylogenetic profiles
Segment(Multiple) phylogenetic profiles
Residue phylogenetic profiles
 Computing Co-evolutionary matrices
 Deriving phylogenetic similarity scores
Protein phylogenetic profiles
 Phylogenetic profile is a
vector which tells about the
existence of a protein in a
genome.
 Let P = {P1,P2,…,Pn} be the
set of proteins and,
G = {G1,G2,…,Gm} be the set
of Genomes
 Every row represents binary
phylogenetic profile of a
protein.
Protein phylogenetic profiles(contd.)
 Single phylogenetic profile ψi for protein Pi is,
ψi(j) =
-1
,
1 <= j <= m
log(Eij)
where Eij is minimum BLAST E-value of local
alignment between Pi and Gj
 Advantage: gives degree of sequence divergence
Protein phylogenetic profiles(contd.)
 Mutual Information I(X,Y) defined as,
I(X,Y) = H(X) + H(Y) – H(X,Y),
where H(X), Shannon Entropy of X is defined as,
H(X) = ∑ px * log(px),
xЄX
and px = P[X = x]
 Phylogenetic similarity between ψi(j) and ψi(j) is,
μs(Pi,Pj) = I(ψi, ψi)
Segment phylogenetic profiles
 Single profile based methods could miss significant
interactions.
 Domain D12 of P2 follows evolutionary trajectory
similar to P1 and P3 which single profile method didn’t
capture.
Segment phylogen. profiles(contd.)
 Dividing each protein Pi into fixed size segments
S1i,S2i,…,Ski
 Phylogenetic similarity between two proteins,
μM(Pi,Pj) = max I(ψsi, ψtj),
s,t
where ψsi is phylogenetic profile of segment Ski of
protein Pi
Residue phylogenetic profiles
 Problem with multiple phylogenetic profiles:
 Both domains covered together by the segment S22,
overriding their individual phylogenetic profiles.
 Significant local alignment between two proteins
corresponds to the residues covered in the alignment
rather than the whole sequences.
Residue phylog. profiles(contd.)
 A(Pi,Gj) – set of significant local alignments between
Protein Pi and Genome Gj
 T(A) = [rb,re] – interval of residues on Pi
corresponding to each alignment A Є A(Pi,Gj)
 For each residue r on Pi phylogenetic profile is
ψri(j) = min
-1
,
1 <= j <= m
AЄA
log(E(A))
Ar = {A Є A(Pi,Gj): r Є T(A)} is the set of local
alignments that contain r
r
Computing co-evolutionary matrices
 For each protein pair Pi and Pj with lengths li and lj,
co-evolutionary matrix entry Mij(r,s) is,
Mij(r,s) = I (ψri, ψsj),
where
1 <= r <= li and 1 <= s <= lj
 The Co-evolutionary Matrix contains

Information about which regions of the two proteins coevolved

The co-evolved domain(s) appear as a block of high
mutual information scores in the matrix
Deriving phylogenetic similarity
scores
 Phylogenetic similarity scores between two proteins
Pi and Pj is,
μC(Pi,Pj) =
max
1<= r <= li
1<= s <= lj
min
r <= a <= r + W
s <= a <= s + W
Mij(a,b)
where W is the window parameter that quantifies the
minimum size of the region on a protein to be
considered as a conserved domain.
Results
 Implemented and tested on 4311 E.coli proteins
 152 Genomes(131 Bacteria,17 Archaea,4 Eukaryota)
 Value of f (down-sampling factor) = 30, W = 2
 These values translate in overlapping segments of 60
residue long
 Excluded homologous proteins from analysis
 Define p-value as fraction of non-homologous protein
pairs (N)
Results (contd.)





MIS – Mutual Information Score
PP – No. of predicted protein pairs
PPV = TP / (TP + FP)
For all μ*, coverage = TP + FP
TN and FN are the no. of protein pairs that do not meet the threshold
Results (contd.)


Co-evolutionary matrix has 1.5 times greater coverage at PPV = 0.7 than the
single profile method
At same no. of PP, Co-evolutionary matrix has better PPV and sensitivity
values than single profile method
Results (contd.)
Mutual Information score
distribution for interacting
and non-interacting protein
pairs
 At 0 MIS, SP shows a
peak while CM
doesn’t. In other ways,
at low MIS scores, SP
scores over CM
Results (contd.)
 Shows p-values of Single Profile
method v/s Co-evolutionary
Matrix method
 Scattered circles show that
the two methods can predict
very differently
Results (contd.) – Phosphotransferase system

Domain IIA(residues 1-170) and domain IIB(residue 170-320)
 Darker region shows that the domains have co-evolved. So we can
conclude that IIB evolved with IIC rather than IIA

Top-20 predicted interacting partners of protein IIAB for both methods
Results (contd.) - Chemotaxis

N-terminus of CheA(residues 1-200) and C-terminus
of CheA(residues 540-670) co-evolved with Cterminus region of CheB (residues 170-340)

Top-20 predicted interacting partners of protein CheA
using both methods
Results (contd.) – Kdp System

N-terminal domain of KdpD (residues 1-395)
co-evolved with KdpC

Top-10 predicted interacting partners of
protein KdpD using both methods
Conclusion
 Results in this paper strongly suggest that co-
evolution of proteins should be captured at the
domain level

Because domains with conflicting evolutionary histories
can co-exist in a single protein sequence

Regions that are important for supporting both
functional and physical interactions between proteins
can be detected
Questions
Thank You !!
Download