file 5

advertisement
101 Optimal PDB Structure
Alignments:
A Branch-and-Cut Algorithm for the
Maximum Contact Map Overlap
Problem
Giuseppe Lancia
Robert Carr
Brian Walenz
Sorin Istrail
Contact Maps
CONTACT MAPS
Unfolded protein
CONTACT MAPS
Unfolded protein
Folded protein = contacts
CONTACT MAPS
Unfolded protein
Folded protein = contacts
Contact map = graph
CONTACT MAPS
Unfolded protein
Folded protein = contacts
Contact map = graph
OBJECTIVE: align 3d folds of proteins
= align contact maps
Contact Map of a Self-Avoiding Walk
12345
1
4
1
2
3
4
5
2
3
5
1
2
3
4
5
00010
00010
00011
11100
00100
Contact Map Alignments
Non-crossing Alignments
Protein 1
Protein 2
non-crossing map of residues in protein 1 and protein 2
The value of an alignment
The value of an alignment
The value of an alignment
The value of an alignment
Value = 3
The value of an alignment
Value = 3
We want to maximize the value
Integer Programming Formulation
Integer Programming Formulation
The use of Integer Linear Programming
* Exact solution
* Heuristic + guarantee (LP upper bound)
Integer Programming Formulation
The use of Integer Linear Programming
* Exact solution
* Heuristic + guarantee (LP upper bound)
e
0-1 VARIABLES
yef
yef for e and f contacts
f
Integer Programming Formulation
The use of Integer Linear Programming
* Exact solution
* Heuristic + guarantee (LP upper bound)
e
0-1 VARIABLES
yef
yef for e and f contacts
e
e’
f
CONSTRAINTS
yef + ye’f’ <= 1
f
f’
Integer Programming Formulation
The use of Integer Linear Programming
* Exact solution
* Heuristic + guarantee (LP upper bound)
e
0-1 VARIABLES
yef
yef for e and f contacts
Gy
e
e’
f
CONSTRAINTS
yef + ye’f’ <= 1
f
f’
 e  f yef
OBJECTIVE max
Independent Set Problem
It’s just a huge max independent set problem in Gy:
• a node for each sharing
• an edge for each pair of incompatible sharings
e’’
e’
f’
e
e’
e
f
e’’
f’’
f’’
f
f’
Independent Set Problem
It’s just a huge max independent set problem in Gy:
• a node for each sharing
• an edge for each pair of incompatible sharings
e’’
e’
f’
e
e’
e
f
e’’
f’’
f’’
f
f’
|Gy|=|E1|*|E2| (approximately 5000 for two proteins with 50 residues and 75 contacts each)
The best exact algorithm for independent set can solve for at most a few hundred
nodes
Node to Node Variables
New variables x provide an easy check for the non-crossing conditions
e
NEW VARIABLES
xij for i and j residues
i
xij
j
yef
f
Node to Node Variables
New variables x provide an easy check for the non-crossing conditions
e
NEW VARIABLES
xij for i and j residues
NEW CONSTRAINTS
i’
i
j’
j
xij + xi’j’ <= 1
i
xij
j
yef
f
Node to Node Variables
New variables x provide an easy check for the non-crossing conditions
e
NEW VARIABLES
xij for i and j residues
i
yef
xij
j
f
NEW CONSTRAINTS
i’
i
j’
p
i
j
q
j
xij + xi’j’ <= 1
y(ip)(jq) <= xij and y(ip)(jq) <= xpq
Clique Constraints
Variables x define a graph Gx:
• A node for each line
• An edge between each pair of crossing lines
i’
i
i
j
i’
j’
j’
j
Clique Constraints
Variables x define a graph Gx:
• A node for each line
• An edge between each pair of crossing lines
i’
i
i
j
i’
j’
j’
j
• Gx is much smaller than Gy
• Gx has nice proprieties (it’s a perfect graph)
• It’s easier to find large independent sets in Gx
Clique Constraints
Non-crossing constraints can be extended to
CLIQUE CONSTRAINTS

xij <= 1
[i,j] in M
For all sets M of mutually incompatible (i.e. crossing) lines
All clique constraints satisfied (and Gx perfect) imply a strong bound!
Structure of Maximal cliques in Gx
1. Pick two subsets of same size
Structure of Maximal cliques in Gx
Structure of Maximal cliques in Gx
2. Connect them in a zig-zag fashion
Structure of Maximal cliques in Gx
Structure of Maximal cliques in Gx
Structure of Maximal cliques in Gx
Structure of Maximal cliques in Gx
Structure of Maximal cliques in Gx
Structure of Maximal cliques in Gx
Structure of Maximal cliques in Gx
Structure of Maximal cliques in Gx
Structure of Maximal cliques in Gx
Structure of Maximal cliques in Gx
Structure of Maximal cliques in Gx
Structure of Maximal cliques in Gx
Structure of Maximal cliques in Gx
Structure of Maximal cliques in Gx
Structure of Maximal cliques in Gx
Structure of Maximal cliques in Gx
Structure of Maximal cliques in Gx
Structure of Maximal cliques in Gx
Structure of Maximal cliques in Gx
Structure of Maximal cliques in Gx
3. Throw in all lines included in a zig or a zag
Structure of Maximal cliques in Gx
Structure of Maximal cliques in Gx
3. Throw in all lines included in a zig or a zag
Structure of Maximal cliques in Gx
Structure of Maximal cliques in Gx
The result is a maximal clique in Gx
Separation of Clique Inequalities
PROBLEM
There exist exponentially many such cliques (O(22n) inequalities).
How do we add them ?
Separation of Clique Inequalities
PROBLEM
There exist exponentially many such cliques (O(22n) inequalities).
How do we add them ?
SOLUTION
We don’t add them in the original LP, but only when needed at run
time. Not all of them will be needed, so we are fine as long as…
Separation of Clique Inequalities
PROBLEM
There exist exponentially many such cliques (O(22n) inequalities).
How do we add them ?
SOLUTION
We don’t add them in the original LP, but only when needed at run
time. Not all of them will be needed, so we are fine as long as…
SEPARATION
…we can generate in polynomial time a clique inequality when needed,
i.e., when violated by the current LP solution x*

[i,j] in M
x*ij > 1
Separation of Clique Inequalities
PROBLEM
There exist exponentially many such cliques (O(22n) inequalities).
How do we add them ?
SOLUTION
We don’t add them in the original LP, but only when needed at run
time. Not all of them will be needed, so we are fine as long as…
SEPARATION
…we can generate in polynomial time a clique inequality when needed,
i.e., when violated by the current LP solution x*

x*ij > 1
[i,j] in M
THEOREM
We can find the most violated clique inequality in time O(n2)
Separation of Clique Inequalities
PROOF (sketch)
1) Clique = zigzag path
Separation of Clique Inequalities
PROOF (sketch)
1) Clique = zigzag path
1
2
3 4 5 6
7
8
Separation of Clique Inequalities
PROOF (sketch)
2) Flip one graph: zigzag
1) Clique = zigzag path
1
2
3 4 5 6
7
8
8 7
6 5 4 3
leftright
2
1
Separation of Clique Inequalities
PROOF (sketch)
2) Flip one graph: zigzag
1) Clique = zigzag path
1
2
3 4 5 6
7
8
8 7
6 5 4 3
3) Define lengths for arcs so that length(P) = x*(clique(P))
leftright
2
1
Separation of Clique Inequalities
PROOF (sketch)
2) Flip one graph: zigzag
1) Clique = zigzag path
1
2
3 4 5 6
7
8
8 7
6 5 4 3
leftright
2
3) Define lengths for arcs so that length(P) = x*(clique(P))
4) Use dynamic programming to find P of max length in time O(n2)
1
Genetic Algorithm Heuristic
Genetic Algorithm Overview

A Population of candidate solutions that
evolve (improve) over time
Population
at time t

Recombination
operators
Evaluation
function
Recombination creates new candidate solutions via
crossover and mutation
Population
at time t+1
Genetic Algorithm Heuristic
Crossover
Crossover

Crossover selects pieces from both parents and creates two
offspring solutions
Offspring
Blue Parent
Red Parent
Crossover

Crossover selects pieces from both parents and creates two
offspring solutions

Select a set of edges in one parent to copy to the child
Crossover

Crossover selects pieces from both parents and creates two
offspring solutions

Select a set of edges in one parent to copy to the child
Crossover

Crossover selects pieces from both parents and creates two
offspring solutions

Select a set of edges in one parent to copy to the child
 Copy as many edges as possible from the other parent
Crossover

Crossover selects pieces from both parents and creates two
offspring solutions

Select a set of edges in one parent to copy to the child
 Copy as many edges as possible from the other parent
These edges conflict with existing
edges and are not copied
Crossover

Crossover selects pieces from both parents and creates two
offspring solutions

Select a set of edges in one parent to copy to the child
 Copy as many edges as possible from the other parent
 Add random edges to fill any remaining space
Crossover

Crossover selects pieces from both parents and creates two
offspring solutions

Select a set of edges in one parent to copy to the child
 Copy as many edges as possible from the other parent
 Add random edges to fill any remaining space
Genetic Algorithm Heuristic
Mutation
Mutation

Mutation introduces small changes to existing solutions by
shifting edge endpoints
Mutation

Mutation introduces small changes to existing solutions by
shifting edge endpoints

Select a set of endpoints to shift
 Top or bottom?
 All edges to the left or right of a selected edge?
 Shift to the left or the right?
Mutation

Mutation introduces small changes to existing solutions by
shifting edge endpoints

Select a set of endpoints to shift
 Top or bottom?
 All edges to the left or right of a selected edge?
 Shift to the left or the right?
Mutation

Mutation introduces small changes to existing solutions by
shifting edge endpoints

Select a set of endpoints to shift
 Top or bottom?
 All edges to the left or right of a selected edge?
 Shift to the left or the right?
This edge “fell off” the
end of the contact map
and is removed
Mutation

Mutation introduces small changes to existing solutions by
shifting edge endpoints

Select a set of endpoints to shift
 Top or bottom?
 All edges to the left or right of a selected edge?
 Shift to the left or the right?
 Randomly add new edges
Mutation

Mutation introduces small changes to existing solutions by
shifting edge endpoints

Select a set of endpoints to shift
 Top or bottom?
 All edges to the left or right of a selected edge?
 Shift to the left or the right?
 Randomly add new edges
Computational Results
Computational Results

Branch-and-Cut Results
 269 proteins

64 to 72 residues
 80 to 140 contacts

Selected 597 pairs of proteins out of 36046 possible

roughly as many similar pairs as dissimilar pairs
Number of
Instances
Average/Max
Num. Residues
Average/Max
Num. Contacts
Num. GA Best
Num. LS1 Best
Num. LS2 Best
0
42
1
48
2
72
66.4/69
66.8/72
66.7/71
61.1/92
56.3/89
38
25
5
44
20
0
Optimality Gap
3
71
4
76
5
95
>5
193
67.0/72
67.0/71
66.8/72
66.8/72
57.3/93
59.7/95
61.5/88
64.7/89
71.4/133
63
35
0
61
31
1
64
33
5
74
35
12
155
82
53
Skolnick Clustering Test
Skolnick Test Results

Four Families
1 Flavodoxin-like fold Che-Y related
2 Plastocyanin
3 TIM Barrel
4 Ferratin





alpha-beta
8 structures
up to 124 residues
15-30% sequence similarity
< 3Å RMSD
Skolnick Test Results

Four Families
1 Flavodoxin-like fold Che-Y related
2 Plastocyanin
3 TIM Barrel
4 Ferratin





beta
8 structures
up to 99 residues
35-90% sequence similarity
< 2Å RMSD
Skolnick Test Results

Four Families
1 Flavodoxin-like fold Che-Y related
2 Plastocyanin
3 TIM Barrel
4 Ferratin





alpha-beta
11 structures
up to 250 residues
30-90% sequence similarity
< 2Å RMSD
Skolnick Test Results

Four Families
1 Flavodoxin-like fold Che-Y related
2 Plastocyanin
3 TIM Barrel
4 Ferratin





alpha
6 structures
up to 170 residues
7-70% sequence similarity
< 4Å RMSD
Skolnick Test Results

Four Families
1 Flavodoxin-like fold Che-Y related
2 Plastocyanin
3 TIM Barrel
4 Ferratin
Family
1
Style
alpha-beta
Residues
124
Seq. Sim.
15-30%
RMSD
< 3A
2
beta
99
35-90%
< 2A
3
alpha-beta
250
30-90%
< 2A
170
7-70%
< 4A
4
Proteins
1b00, 1dbw, 1nat, 1ntr,
1qmp, 1rnl, 3cah, 4tmy
1baw, 1byo, 1kdi, 1nin,
1pla, 3b3i, 2pcy, 2plt
1amk, 1aw2, 1b9b, 1btm,
1hti, 1tmh, 1tre, 1tri,
1ydv, 3ypi, 8tim
1b71, 1bcf, 1dps, 1fha,
1ier, 1rcd
Clustering
Define score(P1, P2) as
# shared contacts
0 <=
<= 1
Min # of contacts of P1,P2
Put P1, P2 in same family if score(P1, P2) >= threshold
Clustering
Define score(P1, P2) as
# shared contacts
0 <=
<= 1
Min # of contacts of P1,P2
Put P1, P2 in same family if score(P1, P2) >= threshold
If P1, P2 too big, use G.A. and local search to compute score
L.P. gives then bounds:
HEURISTICS score <= OPT score <= LP bound
and we know how far off OPT we are
Clustering validation
We got some known families from biologists, PDB.
Experiment: Take a family F of proteins and align them against each
other and against the remaining.
Clustering validation
We got some known families from biologists, PDB.
Experiment: Take a family F of proteins and align them against each
other and against the remaining.
TYPICAL BEHAVIOUR
score
proteins were…
0.05
0.1
0.15
0.2
0.25
0.3
0.35
……
1.0
MISMATCH
MISMATCH
MISMATCH
MISMATCH
MISMATCH
MISMATCH
MATCH
……
MATCH
Skolnick Results

Performance

528 alignments
 1.3% false negative
 0.0% false positive
Clustering ...
Clustering
STRUCTURAL
GENOMICS
Structural Genomics
Structure Similarity
Structure Alignment
Fold Recognition
Fold Assignment
Structure Alignment
“Structural
Twins:” structurally similar, but
sequentially unrelated proteins
Agreement that two proteins (or a group) are similar, but
no agreement on how they are similar
Variety of similarity criteria proposed, e.g., global similarity
based on RMS on alpha carbons, or e.g., local similarity based
on packing and interaction patterns
Standard of Truth
Structure alignment regarded as a “standard of truth” for
Sequence alignment
• Best scoring matrices
• Alignment strategies
•
Published structure alignments differ in almost all positions
Quantifying Similarity
RMS optimal superposition of alpha carbon atoms
Difference between distance maps
Contact map overlap
Scoring Functions
Non-locality of scoring functions
Insertions and deletions
Capturing geometric pattern
Contact Map
Homology Modeling: The Differential
Protein Folding Problem
The prediction of change in structure from change in sequence
The hypothesis: small changes in sequence often produce
relatively small changes in protein structure
Fold Recognition: The Sequence-Structure
Alignment Problem
Identify a folding pattern of a protein as similar to one
or more known structures in a library of protein folds
Measure of sequence-structure compatibility
Ab initio Protein Folding Problem
•Full of partial set of three-dimensional coordinates
•Assignement of secondary structures
•Sets of contacts between residues
•Sets of contacts between helices and strands
Detection of Structural Similarity
Similarity of two sets of atoms with known correspondences
sets have same size; correspondence order preserving
Similarity of two sets of atoms with unknown correspondence
sets have different sizes; insertion/delitions;
correspondence order preserving
Similarity of two sets of atoms with unknown correspondence
no restriction on the correspondence
The Assessment of Structure Prediction
Problem
Less general than the general than the problem of finding maximal
common structures of proteins because the alignment is fixed
Finding common substructures is not well formulated because
substructures of different sizes will fit to different accuracies
Download