PhyCMAP: - Toyota Technological Institute at Chicago

advertisement
PhyCMAP: Predicting protein
contact map using evolutionary and
physical constraints by integer
programming
Zhiyong Wang and Jinbo Xu
Toyota Technological Institute at Chicago
Web server at http://raptorx.uchicago.edu
See http://arxiv.org/abs/1308.1975 for an extended version
Problem Definition
Contact : Distance between two Cα or Cβ atoms < 8Å
1J8B
short range: 6-12 AAs apart
medium range: 12-24 AAs
long range: >24 AAs apart
Existing Work
Residue co-evolution method: mutual information (MI), PSICOV, Evfold
 Needs a large number of homologous sequences


PSICOV and Evfold better than MI since they differentiate direct and indirect
residue couplings (Residues A and C indirect coupling if it is due to direct A-B
and B-C couplings)
PSICOV and Evfold also enforce sparsity
Supervised learning method: NNcon, SVMcon, CMAPpro
 Mutual information, sequence profile and others
 Predicts contacts one by one, ignoring their correlation
 Do not differentiate direct and indirect residue couplings
First-principle method: Astro-Fold
 No evolutionary information
 Minimize contact potential
 Enforce physical feasibility including sparsity
Our Method: PhyCMAP
1. Focus on proteins with few
sequence homologs
 proteins with many sequence homologs
very likely have similar templates in PDB
2. Integrate by machine
learning
 seq profile, residue co-evolution and
non-evolutionary info
 (implicitly) differentiate direct and
indirect residue couplings through
feature engineering
3. Enforce physical constraints,
which imply sparsity
Info used by Random Forests
• Evolution info from a single protein family
– sequence profile
– co-evolution: 2 types of mutual information (MI)
• Non-evolution info from the whole structure
space: residue contact potential
• Mixed info from the above 2 sources
– homologous pairwise contact score
– EPAD: context-specific evolutionary-based distancedependent statistical potential
• amino acid physic-chemical properties
Mutual Information
1. Contrastive Mutual Information (CMI): remove local
background by measuring the MI difference of one pair
with its neighbors.
2. Chaining effect of residue couplings: MI, MI2, MI3,
MI4, equivalent to (1-MI), (1-MI)2, (1-MI)3, (1-MI)4 (see
http://arxiv.org/abs/1308.1975 for more details)
CMI Example: 1J8B
• Upper triangle: mutual information
• Lower triangle: contrastive mutual information
• Blue boxes: native contacts
Homologous Pairwise Contact Score
Probability of a residue pair forming a contact between 2
secondary structures.
PSbeta (a, b): prob of two AAs a and b forming a beta contact
PShelix (a, b): prob of two AAs a and b forming a helix contact
H: the set of sequence homologs in a multiple seq alignment
1
𝑃𝑆 𝑖, 𝑗 =
(
𝐻
ℎ∈𝐻
𝑃𝑆𝑏𝑒𝑡𝑎 ℎ𝑖 , ℎ𝑗 𝑜𝑟
𝑃𝑆ℎ𝑒𝑙𝑖𝑥 (ℎ𝑖 , ℎ𝑗 ))
ℎ∈𝐻
Training Random Forests
• Training dataset
–
–
–
–
Chosen before CASP10 started
900 non-redundant protein structures
<25% sequence identity
All contacts and 20% of non-contacts
• Model parameters
– Number of features: 300
– Number of trees: 500
– 5 fold cross validation
Select Physically Feasible Contacts by
Integer Linear Programming
Xi,j
Indicate one contact between two residues i and j
Rr
a relaxation variable of the rth soft constraint
g(R)
penalty for violation of physical constraints
Maximize accumulative contact probability while
minimize violation of physical constraints
max  X i , j  Pi , j  g ( R)
X ,R
j i  6
Soft Constraints 1
# contacts between two secondary structure
segments is limited
s1,s2
H,H
H,E
H,C
E,H
E,E
E,C
C,H
C,E
C,C
95%
5
3
4
4
9
6
3
5
6
Max
12
10
11
12
13
15
12
12
20
i, SStype(i )  s1,
X
i, j
j:SStype ( j )  s 2
 R1  bs1, s 2
Soft Constraints 2
Upper and lower bounds for #contacts between
two beta strands
X
 R2 
i, j
iSSeg ( v ), jSSeg ( u )
3  Su ,v  min( Len(u ), Len(v))
X

i, j
iSSeg ( v ), jSSeg ( u )
3.3  max( Len(u ), Len(v))  R3
Soft Constraints 3
Statistics shows that only 3.4% of loop segments
that have a contact between the start and end
residues.
Hard Constraints 1
• For parallel contacts between two β strands,
the contacts of neighboring residue pairs
should satisfy the following constraints
X i , j  X i1, j 1  X i1, j 1  1
• For anti-parallel contacts
X i , j  X i1, j 1  X i1, j 1  1
Hard Constraints 2
1) One residue cannot form contacts with both j
and j+2 when j and j+2 are in the same alpha
helix
X i , j  X i , j 2  1
2) One beta-strand can form beta-sheets
with up to 2 other beta-strands.
Test Datasets
• CASP10: 123 proteins
– 36 are “hard”, i.e., no similar templates in PDB
– low sequence identity (<25%) among them
– low seq id with the training data, which were chosen
before CASP10 started
• Set600: 601 proteins
– share <25% seq ID with the training proteins
– each has ≥50 AAs and an X-ray structure with
resolution <1.9Å
– each has ≥5 AAs with predicted secondary structure
being alpha-helix or beta-strand
Accuracy w.r.t. #sequence homologs
1. Meff: #non-redundant sequence homologs of a protein
2. Divide the CASP10 targets into groups by Meff
3. Top L/10 predicted medium- and long-range contacts
accuracy
logMeff
Results on CASP10 – Medium Range
PhyCMAP
PhyCMAP
Overall accuracy on top L/5 predicted Cβ contacts:
PhyCMAP 0.465, CMAPpro 0.370, PSICOV 0.316
CMAPpro
PSICOV
Results on CASP10 – Long Range
PhyCMAP
PhyCMAP
Overall accuracy on top L/5 predicted Cβ contacts:
PhyCMAP: 0.373, CMAPpro: 0.313, PSICOV: 0.315
CMAPpro
PSICOV
Results on 36 hard CASP10 targets
PhyCMAP
PhyCMAP
accuracy on top L/5 medium and long-range Cβ contacts:
PhyCMAP: 0.363, CMAPpro: 0.308, PSICOV: 0.180
CMAPpro
PSICOV
Results on Set600 with few homologs
(Meff ≤ 100)
PhyCMAP
PhyCMAP
top L/5 predicted medium and long Cβ contacts:
PhyCMAP: 0.345, CMAPpro: 0.287, PSICOV: 0.059
CMAPpro
PSICOV
Example: T0677-D2
Dozens of sequence homologs Meff=31
Upper triangle: native Cβ contacts
Left lower triangle: PhyCMAP accuracy 0.357
Right lower triangle: Evfold accuracy ~0
Note contacts between alpha helices are not continuous
Example: T0693-D2
Many sequence homologs Meff=2208
Upper triangles: native Cβ contacts
Left lower triangle: PhyCMAP accuracy 0.744
Right lower triangle: Evfold accuracy 0.419
Example: T0701-D1
Many sequence homologs Meff=3300
Upper triangle: native Cβ contacts
Left lower triangle: PhyCMAP accuracy 0.794
Right lower triangle: Evfold accuracy 0.444
Example: T0756-D1
Many sequence homologs Meff=1824
Upper triangles: native Cβ contacts
Left lower triangle: PhyCMAP accuracy 0.944
Right lower triangle: Evfold accuracy 0.500
Summary
 Combining seq profile,
residue co-evolution, nonevolutionary info can result
in good accuracy even for
proteins with 10--100 nonredundant seq homologs
 Physical constraints are
helpful for proteins with few
sequence homologs
0.5
0.4
0.3
0.2
L/10 L/5 L/10 L/5
with
physical
constraints
no physical
constraints
Short- Medium
range and longcontacts range
Cβ accuracy on 130 proteins Meff ≤ 100
Acknowledgements
• Student: Zhiyong Wang
• Funding
– NIH R01GM0897532
– NSF CAREER award
– Alfred P. Sloan Research Fellowship
• Computational resources
– University of Chicago Beagle team
– TeraGrid
Web server at http://raptorx.uchicago.edu
Protein contact
Contact : Distance between two Cα or Cβ atoms < 8Å; or
Distance between the closest atoms of 2 residues.
1J8B
short range: 6-12 AAs apart
medium range: 12-24 AAs
long range: >24 AAs apart
Why contact prediction?
• Contacts describe spatial and functional
relationship of residues
• Contains key information for 3D structure
• Useful for protein structure prediction
• Used for protein structure alignment and
classification
Contrastive Mutual Information
Contrastive Mutual Information (CMI) removes
local background, by measuring the MI
difference between one pair of residues and
neighboring pairs.
Integer Linear Programming
max  X i , j  Pi , j  g ( R)
• Objective function: X , R j i 6
• g(R): penalty for violation of physical constraints
Variables
Explanations
equal to 1 if there is a contact between
Xi,j
two residues i and j.
equal to 1 if two beta-strands u and v
APu,v
form an anti-parallel beta-sheet.
equal to 1 if two beta-strands u and v
Pu,v
form a parallel beta-sheet.
equal to 1 if two beta-strands u and v
Su,v
form a beta-sheet.
equal to 1 if there is an alpha-bridge
Tu,v
between two helices u and v.
a non-negative integral relaxation
Rr
variable of the rth soft constraint.
Hard Constraints 3
One beta-strand can form beta-sheets with up
to 2 other beta-strands.
S
u ,v
v:SStype( v )beta
2
Global constraints
• Antiparallel and parallel contacts
APu,v  Pu ,v  Su ,v
• A residue contact implies a segment-wise
contact
X i , j  Su,v , i  SSeg(u), j  SSeg(v)
• Put a limit of total number of contacts
X
i, j
1i  j  L, j  i  6
k
– k is the number of top contacts we want to predict.
Results on Set600 with many sequence
homologs (Meff > 100)
PhyCMAP
PhyCMAP
top L/5 predicted medium and long Cβ contacts:
PhyCMAP: 0.611, CMAPpro: 0.515, PSICOV: 0.569
CMAPpro
PSICOV
Contribution of HPS and CMI features
Average Cβ accuracy the 471 proteins with Meff >100
0.7
0.6
with CMI and HPS
0.5
no CMI and HPS
0.4
L/10
L/5
L/10
L/5
Short-range contacts Medium and long-range
Contribution of physical constraints
Average Cβ accuracy on 130 proteins with Meff ≤ 100
0.5
0.4
with physical
constraints
no physical
constraints
0.3
0.2
L/10
L/5
L/10
L/5
Short-range contacts Medium and long-range
Download