Project Work Statement - IPAM

advertisement
Institute for Pure and Applied Mathematics
Research in Industrial Projects for Students
Summer Session 2002
Methods to Compare Distance Matrices and/or Phylogenetic Trees
Georgiy Elfond, University of California Berkeley
Jason Gertz, Cornell University, Project Manager
Anna Shustrova, University of California Berkeley
Matt Weisinger, Harvard University
Shawn Cokus, Faculty Mentor
Bruce Rothschild, Faculty Mentor
Matteo Pellegrini, Technical Advisor
1
The Company
Protein Pathways, a privately held biotechnology company, is a pioneer in the development of
proprietary algorithms designed to functionally interpret sequenced genomes. Founded in 1999,
Protein Pathways has established itself as the forerunner in the discovery of molecular
interactivity and the understanding of network interactions between pathways, which will
inevitably facilitate the drug discovery process. Protein Pathways is an emerging leader in
utilizing computational methods to direct and optimize pathway-based therapeutic drug
development. Its proprietary computational technology, ProNexus, helps to establish a strategic
position in the industry-wide effort to construct a complete human protein interaction map. This
human protein interaction map will facilitate the compilation of complete intracellular pathways
and the identification of new drug targets. ProNexus utilizes and analyzes protein interactions
from many sources: fully sequenced genomes, expression microarrays, literature abstracts, and
proteomic measurements. All of this information is then analyzed using proprietary statistical
algorithms to determine intracellular pathways and regulatory networks, ultimately yielding most
likely drug targets. With ProNexus technology at its core, Protein Pathways has created an
infrastructure for drug development that relies on a defined network of interactions of human
proteins, which in turn enables the identification and characterization of promising drug targets
and drugs.
Goal
We have access to the sequences of an extremely large number of proteins. The goal now is to
use this information to expand our knowledge about the interactions and relationships between
proteins and to then accelerate drug development and other biological research with the help of
computational methods.
Objectives
Proteins are composed of 20 different types of amino acids. The order of amino acids
determines a particular protein sequence. From the alignment of these protein sequences, values
that contain the “distances” between each pair of proteins can be generated. These “distances”
represent the dissimilarity between the amino acids sequences of the two separate proteins and
are calculated using existing alignment algorithms (see Approach). These values can then be
organized in an NxN distance matrix, where N is the number of proteins involved in the
alignment and the (i,j)th entry of the matrix corresponds to the value of the alignment between the
sequences of protein i and protein j. This information can also be partially displayed in twodimensional phylogenetic trees. We can embed the distance matrix into the two-dimensional
plane where the vertices are proteins and the lengths of the edges represent the corresponding
distances between the proteins.
A particular application of protein alignment is the determination of the relationships between
ligands and receptors. Ligands are small molecules that bind to a protein or other structures, and
receptors are membrane-bound or membrane-enclosed molecules that bind to something more
mobile (ligands) with high specificity. With knowledge of ligand-receptor interactions, effective
drugs can be found more efficiently.
2
Our specific challenge is to develop and implement methods that identify which ligands bind to
which receptors. Because a typical dataset contains 100 proteins, there are 100!2 permutations
when comparing two of the datasets. Therefore, we aim to find an efficient approximate solution
since a brute force enumeration of matches to find good matches is not feasible.
Approach
Starting with protein sequences, we will use the program PSI-BLAST to generate multiple
sequence alignments to similar sequences in the PSI-BLAST database. Then, by entering this
data into the program ClustalW, we will obtain the respective distance matrices of these protein
sequence alignments. Given two distance matrices, we will find a “best fit” between subsets in
the two datasets. “Best fit” has to be rigorously defined, and this is also part of the project.
Some current techniques to measure fit are described below.
We will begin by developing a basic algorithm that will accept two distance matrices in a simple
ASCII format. The program will scale them (e.g., by dividing each element of the distance
matrix by the average value of its entries), in order to compare them without distorting the
relative distances in the matrix. Using a Monte Carlo method, it will maximize the fit between
the two matrices. After maximizing this fit, the program will indicate the best couplings between
the two sets of proteins.
A major factor on the efficiency and the accuracy of our program relies on the techniques used to
measure the fit between the distance matrices, X and Y. One method is to sum the squared
differences between corresponding elements, i.e.
d : i 1  j 1 ( X ij  Yij ) 2 .
N
N
A different measure of fit is the correlation coefficient r  [–1, 1] given by
r :
  ( X  X )(Y
  (Y  Y )  
N 1
N
i 1
j i 1
N 1
N
i 1
j i 1
ij
2
ij
ij
Y)
N 1
N
i 1
j i 1
( X ij  X ) 2
where X is the mean of all Xij-values and Y is the mean of all Yij-values (Goh et al., 2000).
When using the correlation coefficient, we are looking to maximize r and no scaling is required.
We will refine our method by improving the algorithm and running extensive tests. We will
further develop our algorithm by removing the assumption of one-to-one matches, accounting for
possible clusters between the groups of proteins as when one receptor binds to multiple ligands.
If time permits, we will compare these matrices as phylogenetic trees with a new algorithm that
we would develop and implement, which will find the minimum common subgraph (or the most
similar portions between the two complete trees). In this case, we would use the PHYLIP
program to transform the distance matrices into the phylogenetic trees that we would use as
input.
3
We aim to produce a program that, on a PC with a Pentium III processor, will take less than a
day to process typical datasets of 100 proteins. If time permits, we will develop statistical
methods to determine the significance of matches produced by the code. We might represent the
final results graphically, in order to visualize the success of a particular alignment and to assist
the end user in making judgments of which tentative matches explore in the lab.
Deliverables
We will develop and implement an algorithm as well as improve it through a supporting research
study. The code written by the team will contain internal comments and be accompanied by a
document explaining its use. We will program in either MatLab Release 12 or ANSI C. The
target platform is an industry-standard PC with a Pentium III processor and 256 MB RAM.
Expectations from Protein Pathways
In order to test our program, we expect a suitable amount of data as ASCII-formatted distance
matrices. Alternatively, if we do not receive data we will generate it ourselves using the
programs described above. In addition, we anticipate open communication between our team
and Protein Pathways, including a weekly meeting with Matteo Pellegrini.
Schedule
Mid-term Report: Thursday, July 18, 2002
Mid-term Presentation: Monday, July 22, 2002
Mid-term Program Release: Monday, July 29, 2002
Practice Presentation: Monday, August 13, 2002
Final Project Presentation: Friday, August 16, 2002
Final Report: Thursday, August 22, 2002
Final Program Release: Thursday, August 22, 2002
References
ClustalW (origin 2). 3 Dec. 2002. 28 Jun. 2002. <http://www.clustalw.genome.ad.jp>.
Goh, C., Bogan, A., Joachimiak, M., Walther, D., and Cohen, F. (2000). “Co-evolution of
Proteins with their Interaction Partners.” J. Mol. Biol. 299: 283-293.
Havel, T., Crippen, G., and Kuntz, I. (1983). “The Combinatorial Distance Geometry Method
for the Calculation of Molecular Conformation. I. A New Approach to an Old Problem.” J.
Theor. Biol. 104: 359-81.
Havel, T., Crippen, G., and Kuntz, I. (1983). “The Theory and Practice of Distance Geometry.”
Bull. Math. Biol. 45: 665-720.
Havel, T., Crippen, G., Kuntz, I., and Blaney, J. (1983). “The Combinatorial Distance
Geometry Method for the Calculation of Molecular Conformation. II. Sample Problems and
Computational Statistics.” J. Theor. Biol. 104: 383-400.
Hendy, M., Little, C., and Penny, D. (1984). “Comparing Trees with Pendant Vertices
Labelled.” SIAM J. Appl. Math. 44: 1054-1067.
4
Holm, L. and Sanders, C. (1993). “Protein Structure Comparison by Alignment of Distance
Matrices.” J. Mol. Biol. 233: 123-138.
NCBI BLAST Home Page. 29 Jan. 2001. 28 Jun. 2002. <http://www.ncbi.nlm.nih.gov/BLAST>.
PHYLIP Home Page. 1 Jan. 2002. 28 Jun. 2002. <http://evolution.genetics.washington.edu/
phylip.html>.
Robinson, D. and Foulds, L. (1981). “Comparison of Phylogenetic Trees.” Math. Biosci. 53:
131-147.
Steel, M. (1988). “Distribution of the Symmetric Differences Metric on Phylogenetic Trees.”
SIAM J. Discr. Math. 1: 541-551.
Participants and Contact Coordinates
Team Members
Georgiy Eflond
Jason Gertz
Anna Shustrova
Matt Weisinger
gelfond@uclink.berkeley.edu
jg224@cornell.edu
nusick@uclink.berkeley.edu
weising@fas.harvard.edu
(310) 794-0083
(310) 794-1316
(310) 794-0069
(310) 794-1360
Faculty Mentors
Shawn Cokus
cokus@{ipam,math}.ucla.edu
Bruce Rothschild blr@math.ucla.edu
(310) 825-2814
(310) 825-3174
Technical Advisor from Protein Pathways
Matteo Pellegrini matteope@proteinpathways.com (310) 940-3627
5
Download