Institute for Pure and Applied Mathematics Research in Industrial Projects for Students Summer Session 2002 Methods to Compare Distance Matrices and/or Phylogenetic Trees Georgiy Elfond, University of California Berkeley Jason Gertz, Cornell University, Project Manager Anna Shustrova, University of California Berkeley Matt Weisinger, Harvard University Shawn Cokus, Faculty Mentor Bruce Rothschild, Faculty Mentor Matteo Pellegrini, Technical Advisor 1 The Company Protein Pathways, a privately held biotechnology company, is a pioneer in the development of proprietary algorithms designed to functionally interpret sequenced genomes. Founded in 1999, Protein Pathways has established itself as the forerunner in the discovery of molecular interactivity and the understanding of network interactions between pathways, which will inevitably facilitate the drug discovery process. Protein Pathways is an emerging leader in utilizing computational methods to direct and optimize pathway-based therapeutic drug development. Its proprietary computational technology, ProNexus, helps to establish a strategic position in the industry-wide effort to construct a complete human protein interaction map. This human protein interaction map will facilitate the compilation of complete intracellular pathways and the identification of new drug targets. ProNexus utilizes and analyzes protein interactions from many sources: fully sequenced genomes, expression microarrays, literature abstracts, and proteomic measurements. All of this information is then analyzed using proprietary statistical algorithms to determine intracellular pathways and regulatory networks, ultimately yielding most likely drug targets. With ProNexus technology at its core, Protein Pathways has created an infrastructure for drug development that relies on a defined network of interactions of human proteins, which in turn enables the identification and characterization of promising drug targets and drugs. Goal We have access to the sequences of an extremely large number of proteins. The goal now is to use this information to expand our knowledge about the interactions and relationships between proteins and to then accelerate drug development and other biological research with the help of computational methods. Objectives Proteins are composed of 20 different types of amino acids. The order of amino acids determines a particular protein sequence. From the alignment of these protein sequences, values that contain the “distances” between each pair of proteins can be generated. These “distances” represent the dissimilarity between the amino acids sequences of the two separate proteins and are calculated using existing alignment algorithms (see Approach). These values can then be organized in an NxN distance matrix, where N is the number of proteins involved in the alignment and the (i,j)th entry of the matrix corresponds to the value of the alignment between the sequences of protein i and protein j. This information can also be partially displayed in twodimensional phylogenetic trees. We can embed the distance matrix into the two-dimensional plane where the vertices are proteins and the lengths of the edges represent the corresponding distances between the proteins. A particular application of protein alignment is the determination of the relationships between ligands and receptors. Ligands are small molecules that bind to a protein or other structures, and receptors are membrane-bound or membrane-enclosed molecules that bind to something more mobile (ligands) with high specificity. With knowledge of ligand-receptor interactions, effective drugs can be found more efficiently. 2 Our specific challenge is to develop and implement methods that identify which ligands bind to which receptors. Because a typical dataset contains 100 proteins, there are 100!2 permutations when comparing two of the datasets. Therefore, we aim to find an efficient approximate solution since a brute force enumeration of matches to find good matches is not feasible. Approach Starting with protein sequences, we will use the program PSI-BLAST to generate multiple sequence alignments to similar sequences in the PSI-BLAST database. Then, by entering this data into the program ClustalW, we will obtain the respective distance matrices of these protein sequence alignments. Given two distance matrices, we will find a “best fit” between subsets in the two datasets. “Best fit” has to be rigorously defined, and this is also part of the project. Some current techniques to measure fit are described below. We will begin by developing a basic algorithm that will accept two distance matrices in a simple ASCII format. The program will scale them (e.g., by dividing each element of the distance matrix by the average value of its entries), in order to compare them without distorting the relative distances in the matrix. Using a Monte Carlo method, it will maximize the fit between the two matrices. After maximizing this fit, the program will indicate the best couplings between the two sets of proteins. A major factor on the efficiency and the accuracy of our program relies on the techniques used to measure the fit between the distance matrices, X and Y. One method is to sum the squared differences between corresponding elements, i.e. d : i 1 j 1 ( X ij Yij ) 2 . N N A different measure of fit is the correlation coefficient r [–1, 1] given by r : ( X X )(Y (Y Y ) N 1 N i 1 j i 1 N 1 N i 1 j i 1 ij 2 ij ij Y) N 1 N i 1 j i 1 ( X ij X ) 2 where X is the mean of all Xij-values and Y is the mean of all Yij-values (Goh et al., 2000). When using the correlation coefficient, we are looking to maximize r and no scaling is required. We will refine our method by improving the algorithm and running extensive tests. We will further develop our algorithm by removing the assumption of one-to-one matches, accounting for possible clusters between the groups of proteins as when one receptor binds to multiple ligands. If time permits, we will compare these matrices as phylogenetic trees with a new algorithm that we would develop and implement, which will find the minimum common subgraph (or the most similar portions between the two complete trees). In this case, we would use the PHYLIP program to transform the distance matrices into the phylogenetic trees that we would use as input. 3 We aim to produce a program that, on a PC with a Pentium III processor, will take less than a day to process typical datasets of 100 proteins. If time permits, we will develop statistical methods to determine the significance of matches produced by the code. We might represent the final results graphically, in order to visualize the success of a particular alignment and to assist the end user in making judgments of which tentative matches explore in the lab. Deliverables We will develop and implement an algorithm as well as improve it through a supporting research study. The code written by the team will contain internal comments and be accompanied by a document explaining its use. We will program in either MatLab Release 12 or ANSI C. The target platform is an industry-standard PC with a Pentium III processor and 256 MB RAM. Expectations from Protein Pathways In order to test our program, we expect a suitable amount of data as ASCII-formatted distance matrices. Alternatively, if we do not receive data we will generate it ourselves using the programs described above. In addition, we anticipate open communication between our team and Protein Pathways, including a weekly meeting with Matteo Pellegrini. Schedule Mid-term Report: Thursday, July 18, 2002 Mid-term Presentation: Monday, July 22, 2002 Mid-term Program Release: Monday, July 29, 2002 Practice Presentation: Monday, August 13, 2002 Final Project Presentation: Friday, August 16, 2002 Final Report: Thursday, August 22, 2002 Final Program Release: Thursday, August 22, 2002 References ClustalW (origin 2). 3 Dec. 2002. 28 Jun. 2002. <http://www.clustalw.genome.ad.jp>. Goh, C., Bogan, A., Joachimiak, M., Walther, D., and Cohen, F. (2000). “Co-evolution of Proteins with their Interaction Partners.” J. Mol. Biol. 299: 283-293. Havel, T., Crippen, G., and Kuntz, I. (1983). “The Combinatorial Distance Geometry Method for the Calculation of Molecular Conformation. I. A New Approach to an Old Problem.” J. Theor. Biol. 104: 359-81. Havel, T., Crippen, G., and Kuntz, I. (1983). “The Theory and Practice of Distance Geometry.” Bull. Math. Biol. 45: 665-720. Havel, T., Crippen, G., Kuntz, I., and Blaney, J. (1983). “The Combinatorial Distance Geometry Method for the Calculation of Molecular Conformation. II. Sample Problems and Computational Statistics.” J. Theor. Biol. 104: 383-400. Hendy, M., Little, C., and Penny, D. (1984). “Comparing Trees with Pendant Vertices Labelled.” SIAM J. Appl. Math. 44: 1054-1067. 4 Holm, L. and Sanders, C. (1993). “Protein Structure Comparison by Alignment of Distance Matrices.” J. Mol. Biol. 233: 123-138. NCBI BLAST Home Page. 29 Jan. 2001. 28 Jun. 2002. <http://www.ncbi.nlm.nih.gov/BLAST>. PHYLIP Home Page. 1 Jan. 2002. 28 Jun. 2002. <http://evolution.genetics.washington.edu/ phylip.html>. Robinson, D. and Foulds, L. (1981). “Comparison of Phylogenetic Trees.” Math. Biosci. 53: 131-147. Steel, M. (1988). “Distribution of the Symmetric Differences Metric on Phylogenetic Trees.” SIAM J. Discr. Math. 1: 541-551. Participants and Contact Coordinates Team Members Georgiy Eflond Jason Gertz Anna Shustrova Matt Weisinger gelfond@uclink.berkeley.edu jg224@cornell.edu nusick@uclink.berkeley.edu weising@fas.harvard.edu (310) 794-0083 (310) 794-1316 (310) 794-0069 (310) 794-1360 Faculty Mentors Shawn Cokus cokus@{ipam,math}.ucla.edu Bruce Rothschild blr@math.ucla.edu (310) 825-2814 (310) 825-3174 Technical Advisor from Protein Pathways Matteo Pellegrini matteope@proteinpathways.com (310) 940-3627 5