http://dimacs.rutgers.edu/SpecialYears/2001_Data/Algorithms/MDSTalk.ppt

advertisement
A Novel Geometric Build-Up Algorithm
for Solving the Distance Geometry Problem and Its
Application to Multidimensional Scaling
Zhijun Wu
Department of Mathematics
Program on Bio-Informatics and Computational Biology
Iowa State University
Joint Work with
Tauqir Bibi, Feng Cui, Qunfeng Dong,
Peter Vedell, Di Wu
Distance Geometry
Multidimensional Scaling
mapping from semi-metric to metric spaces
Euclidean and non-Euclidean
data classification
geometric mapping of data
T
S
fundamental problem:
find the coordinates for a set of points, given the distances for all pairs of points
Cayley-Menger determinant
necessary & sufficient conditions of embedding
B
singular-value decomposition method
strain/stress minimization
Molecular Conformation
embedding in 3D Euclidean space
protein structure prediction and determination
sparse, inexact distances, bounds on the distances, probability distributions
Proteins are building blocks of life
and key ingredients of biological
processes.
A biological system may have up to
hundreds of thousands of different
proteins, each with a specific role in
the system.
HIV Retrotranscriptase
an example:
A protein is formed by a polypeptide
chain with typically several hundreds
of amino acids and tens of thousands
of atoms.
A protein has a unique 3D structure,
which determines in many ways the
function of the protein.
4200 atoms
554 amino acids
Molecular Distance Geometry Problem
Given n atoms a1, …, an and a set of distances di,j between ai and
aj, (i,j) in S
find the coordinate s x1 ,, x n for a1 ,..., a n such that
|| x i  x j ||  d i, j ,
(i, j)  S
Problems and Complexity
problems with all distances:
problems with sparse sets of distances:
|| xi  x j || d i , j , (i, j )  D  {( i, j ) | i  j  1,, n}
|| xi  x j || d i , j , (i, j )  S  D
solvable in O (n3) using SVD
NP-complete (Saxe 1979)
problems with distance ranges (NMR results):
li , j  || xi  x j ||  d i , j , (i, j )  S  D
NP-complete (More and Wu 1997), if the ranges are small
problems with probability distributions of distances:
|| xi  x j || d i , j , (i, j )  S  D, d i , j  ([ li , j , ui , j ], pi , j )
stochastic multidimensional scaling, structure prediction
Current Approaches
•
Embed Algorithm by Crippen and Havel
•
CNS Partial Metrization by Brünger et al
•
Graph Reduction by Hendrickson
•
Alternating Projection by Glunt and Hayden
•
Global Optimization by Moré and Wu
•
Multidimensional Scaling by Trosset, et al
Embed Algorithm
time consuming in O(n3~n4)
1.
2.
3.
4.
5.
6.
7.
8.
9.
bound smooth; keep distances consistent
distance metrization; estimate the missing distances
repeat (say 1000 times):
randomly generate D in between L and U
find X using SVD with D
if X is found, stop
select the best approximation X
refine X with simulated annealing
final optimization
Crippen and Havel 1988 (DGII, DGEOM)
Brünger et al 1992, 1998 (XPLOR, CNS)
costly in O(n2~n3)
Geometric Build-Up
Independent Points: A set of k+1 points in Rk is called
independent if it is not a set of points in Rk-1.
Metric Basis: A set of points B in a space S is a metric
basis of S provided each point of S is uniquely
determined by its distances from the points in B.
Fundamental Theorem: Any k+1 independent points
in Rk form a metric basis for Rk.
Blumenthal 1953: Theory and Applications of Distance Geometry
Geometric Build-Up
in two dimension
Geometric Build-Up
in three dimension
Geometric Build-Up
in three dimension
Geometric Build-Up
1
? xi = (ui, vi, wi)
i
2
||xi - x1|| = di,1
||xi - x2|| = di,2
||xi - x3|| = di,3
||xi - x4|| = di,4
4
j
3
x1 = (u1, v1, w1)
x2 = (u2, v2, w2)
x3 = (u3, v3, w3)
x4 = (u4, v4, w4)
? xj = (uj, vj, wj)
||xj - x1|| = dj,1
||xj - x2|| = dj,2
||xj - x3|| = dj,3
||xj - x4|| = dj,4
The geometric build-up algorithm solves a molecular
distance geometry problem in O(n) when distances between
all pairs of atoms are given, while the singular value
decomposition algorithm requires O(n2~n3) computing time!
The X-ray crystallography structure (left) of the HIV-1 RT p66 protein (4200
atoms) and the structure (right) determined by the geometric build-up algorithm
using the distances for all pairs of atoms in the protein. The algorithm took only
188,859 floating-point operations to obtain the structure, while a conventional
singular-value decomposition algorithm required 1,268,200,000 floating-point
operations. The RMSD of the two structures is ~10-4 Å.
Problems with Sparse Sets of Distances
Control of Rounding Errors
Control of Rounding Errors
Tolerate Distance Errors
Tolerate Distance Errors
min
xi
 (|| x
j
i
 x j ||2  di,2 j )2
i
(i,j) in S
j
xj are determined.
The objective function is convex
and the problem can be solved using
a standard Newton method.
min
xi
 (|| x
j
i
 x j ||2  di,2 j )2
(i,j) in S
Each function evaluation requires
order of n floating point operations,
where n is the number of atoms.
xj are determined.
In the ideal case when every atom
can be determined, n atoms require
O(n2) floating point operations.
NMR Structure Determination
The distances are given with their
possible ranges.
i
find x i such that
l i, j  || x i  x j ||  u i, j
(i, j)  S
j
find x i such that
l i, j  || x i  x j ||  u i, j
(i, j)  S
min
xi
 (|| x
j
i
 x j ||2  ui,2 j )2 (li,2 j  || x i  x j ||2 )2
(i, j) in S
The structure of 4MBA (red lines) determined by using a
geometric build-up algorithm with a subset of all pairs of
inter-atomic distances. The X-ray crystallography structure is
shown in blue lines.
The total distance errors (red) for the partial structures of a
polypeptide chain obtained by using a geometric build-up are
all smaller than 1 Å, while those (blue) by using CNS (Brünger
et al) grow quickly with increasing numbers of atoms in the
chain.
Extension to Statistical Distance Data
the distributions of the distances in
structure database
i
max (i, j)S log [p i, j (|| x i  x j ||)]
xi
j
structure prediction
Download