Artificial Intelligence and Robotics Methods in Computational Biology: Papers from the AAAI 2013 Workshop Using Protein Fragments for Searching and Data-Mining Protein Databases Chen Keasar and Rachel Kolodny Ben Gurion University & Haifa University, Israel chen@cs.bgu.ac.il, trachel@cs.haifa.ac.il target of almost all medicines. Hence, developing fast and accurate tools to study them is of great importance. "Scientists" and "practitioners" study proteins from two complementary perspectives. The scientists study currentday proteins to better understand protein evolution and to characterize the physical and chemical constraints that govern their behavior (e.g., protein folding and interaction). For this purpose, data-mining the protein database is useful. Alternatively, the practitioners need tools that will help characterize proteins of interest. Most importantly, given a specific protein, they wish to identify related proteins that are better characterized, and from which relevant knowledge can be derived. For this purpose, fast and accurate search tools are useful. The two perspectives are intertwined: insights to the properties or the evolution of proteins can be used to design better tools, and novel tools can be used to gain insights about the nature of proteins. We focus on computational methods that are structurebased. A protein structure can be described (sufficiently well) by its backbone, and even only by its C-alpha atoms. Namely, we represent a protein of length n by a sequence of coordinates: a1, a2, …, an, where ai R3 i n . There are less than 3n degrees of freedom in this chain, because there are additional geometric constraints (e.g., the distance between two consecutive atoms is nearly fixed). These additional constraints have been characterized empirically, and it was shown that the repertoire of short backbone segments (often denoted fragments) in PDB proteins is fairly limited (Kolodny, et al., 2002). Abstract Proteins are macro-molecules involved in virtually all of life processes. Protein sequence and structure data is accumulated at an ever increasing rate in publicly-available databases. To extract knowledge from these databases, we need efficient and accurate tools; this is a major goal of computational structural biology. The tasks we consider are searching and mining protein data; we rely on protein fragment libraries to build more efficient tools. We describe FragBag – an example of using fragment libraries to improve protein structural search. To search for patterns in structure space, we discuss methods to generate efficient low-dimensional maps. In particular, we use these maps to identify patterns of functional diversity and sequence diversity. Finally, we discuss how to extend these methods to protein sequences. To do this, one needs to predict local structure from sequence; we survey previous work that suggests that this is a very feasible task. Furthermore, we show that such predictions can be used to improve sequence alignments. Namely, protein fragments can be used to leverage protein structural data to improve remote homology detection. Searching and data-mining protein databases To extract "knowledge" from the rapidly expanding protein databases, we need computational tools that can search and data-mine. We focus on the database of protein structures: the PDB (Berman, et al., 2000), with its over 80,000 entries; the databases of protein sequences are 1-2 orders of magnitude larger. These databases describe current-day proteins. Evolutionary theory suggests that these emerged from ancient proteins through a series of mutations and duplications at the sequence level. Thus, sequence similarity of two proteins is considered a sign of homology, or evolutionary relatedness. In cases where the sequences diverged and are no longer similar, similarity of the (more conserved) structures may hint to a remote homology. Proteins are, arguably, the most important macromolecules in the process of life; they are also the Protein structure can be discretized using protein fragments The restricted repertoire of local protein structures can be described at different levels of detail. The coarsest level is secondary structure, which classifies each residue as part of a helix, strand, or coil. Fragment libraries describe local 14 structure in more detail via collections of commonly occurring backbone fragments (Kolodny, et al., 2002). Fragment libraries vary in their construction, in their number and lengths of fragments, and consequently in the accuracy by which their elements can approximate protein structure (see, for example, a review by (Offmann, 2007)). From a practical perspective, fragment libraries are useful: We can describe protein structures as letter strings which correspond to the indices of the best-approximating library fragments for each segment along the protein backbone. This is, in effect, a discretization scheme for protein structure. To some extent, scholars can predict local structure, or the structure of protein fragments from their sequence. Secondary structure prediction classifies residues to one of three states and reaches a 75%-80% correct prediction rate (Rost, 2001). Fragment library predictions are also common: yet the success rates of predictions by different groups (Benros, et al., 2006; Bystroff and Baker, 1998; Bystroff, et al., 2000; De Brevern, et al., 2007; de Brevern, et al., 2000; de Brevern, et al., 2002; Etchebest, et al., 2005; Faraggi, et al., 2009; Hunter and Subramaniam, 2003; Offmann, 2007; Pei and Grishin, 2004; Sander, et al., 2006; Yang and Wang, 2003) cannot be directly compared, because they classify to different libraries, of different size and fragment lengths. Even more so, because predicting the structure of shorter fragments is easier than that of longer fragments (the former can be deduced from the latter, and shorter fragments are better represented in the PDB (Chivian, et al., 2005; Simons, et al., 1999)). Current methods typically consider local structure prediction a classification problem, where a single class (secondary structure element or library fragment) is assigned to each residue. However, when designing a local structure prediction method, one must keep in mind the application of these predictions. Often, this requires optimizing the accuracy in terms of the geometric agreement with the true structure (or its best library approximation). Below, we describe several examples of such applications for search and data-mining methods, and highlight their corresponding optimization functions for protein fragment prediction. of structures. Comparing a pair of structures is wellstudied: this is the so-called protein structural alignment problem which has many solutions (e.g., (Kolodny, et al., 2005; Krissinel and Henrick, 2004; Shindyalov and Bourne, 1998; Subbiah, et al., 1993). Unfortunately, structural alignment methods are relatively slow, implying that the naïve implementation of structure search which compares the query to all PDB proteins, one by one, using structural alignment, is far too expensive computationally. Instead, scholars suggested the "filter-and-refine" paradigm for structure search (Aung and Tan, 2007). A filter method quickly sifts through a large set of structures, and selects a small candidate set to be structurally aligned by a more accurate, yet computationally expensive, method. Filter methods gain their speed by representing structures abstractly – typically as vectors – and quickly comparing these representations. Furthermore, such vector representations can be stored in an inverted index — a data structure that enables fast retrieval of neighbors, even in huge datasets [e.g., (Brin and Page, 1998)]. Filter methods include PRIDE (Carugo and Pongor, 2002), SGM (Rogen and Fain, 2003), a method by Choi et al.(Choi, et al., 2004), an a method by Zotenko et al. (Zotenko, et al., 2006). FragBag is a fast and accurate filter method, which relies on a discrete representation of protein structures using fragment libraries (Budowski-Tal, et al., 2010). FragBag represents structures succinctly as a bag of words (BOW) of their backbone fragments. The BOW is a vector whose entries count the number of times each library fragment approximates a segment in the protein backbone. For example, using a library of 400 fragments, each 11 residues long, a structure is modeled by a 400-long vector in which each entry counts the number of times a specific library fragment best approximates a backbone segment. Using FragBag vectors, one can approximate the similarity between two structures by the similarity between their corresponding vectors. FragBag is more accurate than other publicly-available filters (Carugo and Pongor, 2002; Rogen and Fain, 2003; Zotenko, et al., 2006). More importantly, FragBag detects homologues as reliably as two highly trusted structural alignment methods, STRUCTAL (Subbiah, et al., 1993) and CE (Shindyalov and Bourne, 1998), yet runs several orders of magnitude faster. Also, FragBag is rather robust: its performance is only mildly affected by the parameterization of the fragment library (Budowski-Tal, et al., 2010) and by fraglet assignment errors (data not shown). Interestingly, the success of FragBag reveals something about the distribution of protein structures in Nature, as it exploits the property that protein domain structures which have similar local composition, tend to be globally similar. FragBag: A fast and accurate filter for structure search based on fragment descriptors In protein structural search of the PDB, the query is a structure, specified by a chain in R3, and the goal is to find all PDB proteins whose structure is sufficiently similar to it (allowing insertions/deletions and modifications). More specifically, we focus on the task of identifying proteins with similar structures yet non-similar sequences, since identifying proteins whose sequences and structures are similar is far easier. Namely, we would like to quickly and accurately compare a single query structure vs. a large set 15 is far higher in the all-beta regions than it is in the allalpha regions. Figure 1: Sequence diversity maps of protein structure space (from two views); the SCOP-class maps are shown on the left for comparison. We see that sequence diversity 16 functional and sequence diversity, which are defined at each point of structure space through a whole collection of structures in the vicinity of that point. By coloring the maps according to the values of these properties, we can visualize their distribution across structure space. (Osadchy and Kolodny, 2011) study functional diversity in structure space. Functional diversity measures the variability of function in the structural vicinity of each protein. Notice that to measure the functional diversity in a meaningful manner, we must map redundant and hence fairly large, datasets. There, they show that protein structure space has a functionally diverse core and that diversity drops toward the periphery of the space; furthermore, this is a fundamental characteristic pattern of structure space (Osadchy and Kolodny, 2011). The highly diverse core of structure space includes mainly alpha/beta domains, which were suggested by phylogenetic analysis as most ancient (Winstanley, et al., 2005). This observation has practical value for protein function prediction based on structural similarity: it suggests that if the protein lies in the periphery of structure space, then its neighbors have relatively few functions that need to be considered. If, on the other hand, the protein lies in the functionally diverse core, then its neighbors have jointly many functions to consider. To demonstrate that this is indeed the case (Osadchy and Kolodny, 2011) analyze Watson et al.’s protein function predictions from structural similarity (Watson, et al., 2005), and show that indeed, they were more successful in predicting function from structure for proteins lying in less diverse regions of structure space Data-mining protein structure space using its threedimensional maps Using protein fragment libraries, we can efficiently calculate maps of protein structure space (Osadchy and Kolodny, 2011). Maps are low (2 or 3) dimensional visualizations of structure space. In these maps, we hope to identify meaningful patterns, or to data-mine, protein structure space. Maps of structure space were originally studied by Orengo et al., (Orengo, et al., 1993), Holm and Sander (Holm and Sander, 1996), and then by Kim and colleagues (Hou, et al., 2005; Hou, et al., 2003; Sims, et al., 2005). In these maps, protein structures are drawn as points (in 2 or 3 dimensions), so that the distance between any two points depends on the structural similarity of the proteins they represent. This provides a comprehensive visualization of structure space, which is not constrained by a hierarchical system such as the Structural Classification of Proteins (SCOP) (Murzin, et al., 1995). The maps in the above-mentioned studies were calculated using a computational procedure called multidimensional scaling (MDS) (Tenenbaum, et al., 2000). To calculate these maps for N protein structures, one first needs to calculate all NxN structural similarities amongst them (using any structural alignment method). Then, MDS converts this NxN matrix to a 3xN (or 2xN) of the 3 (or 2) dimensional coordinates for these N proteins. Unfortunately, the calculation requires an eigenvector decomposition of the NxN matrix, thus restricting N, the number of protein structures in a map. Instead, one can rely on the FragBag model and calculate low dimensional maps of structure space far more efficiently (Osadchy and Kolodny, 2011). To calculate these alternative maps for N protein structures, one needs to calculate the FragBag vectors of these proteins: each structure is a vector of size L (where L is the number of fragments in the library); these protein structures can be thus viewed as points in an L dimensional space. Here, we first normalize each FragBag vector, so the Euclidean distance in L dimensions is (a constant factor of the) cosine distance between the FragBag vectors. To project these to a lower dimension, we can then use Principal Component Analysis (PCA). The maps generated by PDA and MDS maps are the same (up to a reflection and rotation of the entire space) if the distances in the MDS matrix are the Euclidean distances between the vectors in the PCA matrix. However, the PCA calculation is far more efficient because it only requires an eigenvalue decomposition of an L × L matrix (L = 400 in our case), regardless of the database size (N). Studying Sequence Diversity Figure 1 shows three dimensional maps of protein structure space, for the sequence diversity in this space. To calculate sequence diversity, we align the sequence (using BLAST) of each protein in the dataset (of over 31000 domains) to 20 (randomly sampled) proteins whose structures lie in its near vicinity in structure space (within 0.0005 in the three dimensional space); then, we average the sequence similarity to these proteins. The average sequence similarity can be used as a measure of the sequence diversity in this vicinity. We color-code the point by the sequence diversity: cases of low diversity (i.e., sequence similarity is high) are colored in red and those with high diversity (i.e., sequence similarity is low) are colored in light blue. This reveals another fundamental characteristic of structure space: sequence diversity in the : all-beta regions is far higher than the sequence diversity in the all-alpha regions. Studying Functional Diversity Using fragment libraries and PCA, we can map a very large set of >30,000 protein structures. Rather than studying single structures, we focus on properties such as 17 each 11 residues long. We see that the filter based on local structure is the top-performer, and it identifies almost all alignments that match similar sub-structures (e.g., GDT_TS greater than 0.8 or RMSD less than 2.5). As a side note we mention that filtering by percent similarity outperforms filtering by E-value. Here, the filter is based on the true local structure of the matched residues. Nonetheless, this is a meaningful observation, because, as we survey above, local structure can be predicted from sequence. Hence, we propose to use local structure predictions to improve the accuracy of sequence alignment methods. Since we suggest a particular filter, this suggests an optimization function for designing a local structure predictor, as well as a meaningful evaluation protocol. We expect the performance of the local structure filter to deteriorate when using predictions; however, since the performance is very good, we believe that this may still prove to be a valuable direction. Local structure can be used to identify meaningful sequence alignments Figure 2: global structure similarity of sequence alignments found by HHSearch and PSI-BLAST, compared to those of filtered sets of alignment based on local structure similarity, percent similarity and E-values. We see that local structure can help identify the alignments that match globally similar sub-structures. Future Directions The tools that we describe above are structure-based, and use protein structure descriptors that are based on fragment libraries. Given the actual structure, it is straight-forward to describe it in terms of a small set of fragments. Less straight-forward, is calculating this description from the protein sequence. Thus, we need to develop ways to predict the fragment approximations from the protein sequence. This will allow us extend these tools to search and data-mine the far larger sequence databases. Fortunately, predicting local structure is easier than predicting full structures, and there is ample evidence that it is very feasible. Importantly, to be used in these applications, the local structure prediction scheme needs to be optimized in terms of the average (local) geometric similarity of the predictions with respect to the true structure. We believe that using protein fragment libraries will enable us to leverage protein structural information to create more sensitive tools to study protein sequences. Next, we consider a seemingly different topic and study how well we can identify which sequence alignments match regions in the proteins that are structurally similar. This topic is closely related, as in this case too, we use fragment libraries to improve out computational tools. To do this, we measure the global geometric similarity of the sub-structures matched by sequence alignments. More specifically, we consider a representative set of (over 6,600) PDB proteins, and align all vs. all sequences using two leading sequence alignment methods: PSI-BLAST (Altschul, et al., 1997) and HHSearch (Soding, 2005). Since we know the structures of the proteins in our dataset, we can also measure the global geometric similarity of the sub-structures matched by the alignment; we use two measures: RMSD and GDT_TS. Figure 2 shows histograms of the geometric similarity scores RMSD and GDT_TS for the set of all sequence alignments found by the two sequence aligners, along with several filtered subsets of these alignments. An ideal filter will maintain only the alignments that match geometrically similar sub-structures (i.e., of low RMSD, or high GDT_TS). Notice that this filter is applied after the alignments are found (hence denoted a post-filter), which is different from one in the filter-and-refine paradigm mentioned above, that is applied before actually aligning the proteins. We consider three filters: (1) by the sequence similarity of the aligned residues (using BLOSUM62 and a threshold value of 30%), (2) by the sequence alignment Evalue (and a threshold of 10-20), (3) by the local structure agreement of matching fragments. The local structure agreement filter is based on the library of 400 fragments, References Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucl. Acids Res., 25, 3389-3402. Aung, Z. and Tan, K.-L. (2007) Rapid retrieval of protein structures from databases, Drug Discovery Today, 12, 732-739. Benros, C., et al. (2006) Assessing a novel approach for predicting local 3D protein structures from sequence, Proteins, 62, 865-880. Berman, H.M., et al. (2000) The Protein Data Bank, Nucleic Acids Res, 28, 235-242. Brin, S. and Page, L. (1998) The Anatomy of a Large-Scale Hypertextual Web Search Engine, Computer Networks and ISDN Systems, 30, 107-117. 18 Budowski-Tal, I., Nov, Y. and Kolodny, R. (2010) FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB quickly and accurately, Proc Natl Acad Sci U S A, 107, 3481-3486. Bystroff, C. and Baker, D. (1998) Prediction of local structure in proteins using a library of sequence-structure motifs, J Mol Biol, 281, 565-577. Bystroff, C., Thorsson, V. and Baker, D. (2000) HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins, J Mol Biol, 301, 173-190. Carugo, O. and Pongor, S. (2002) Protein fold similarity - comparison, Journal of Molecular Biology, 315, 887-898. Chivian, D., et al. (2005) Prediction of CASP-6 structures using automated Robetta protocols, Proteins: Structure, Function, and Bioinformatics, 9999, NA. Choi, I.G., Kwon, J. and Kim, S.H. (2004) Local feature frequency profile: a method to measure structural similarity in proteins, Proc Natl Acad Sci U S A, 101, 3797-3802. De Brevern, A.G., et al. (2007) "Pinning strategy": a novel approach for predicting the backbone structure in terms of protein blocks from sequence, J Biosci, 32, 51-70. de Brevern, A.G., Etchebest, C. and Hazout, S. (2000) Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks, Proteins, 41, 271-287. de Brevern, A.G., et al. (2002) Extension of a local backbone description using a structural alphabet: a new approach to the sequence-structure relationship, Protein Sci, 11, 2871-2886. Etchebest, C., et al. (2005) A structural alphabet for local protein structures: improved prediction methods, Proteins, 59, 810-827. Faraggi, E., et al. (2009) Predicting Continuous Local Structure and the Effect of Its Substitution for Secondary Structure in Fragment-Free Protein Structure Prediction, Structure (London, England : 1993), 17, 1515-1527. Holm, L. and Sander, C. (1996) Mapping the protein universe, Science, 273, 595-603. Hou, J., et al. (2005) Global mapping of the protein structure space and application in structure-based inference of protein function, Proc Natl Acad Sci U S A, 102, 3651-3656. Hou, J., et al. (2003) A global representation of the protein fold space, Proc Natl Acad Sci U S A, 100, 2386-2390. Hunter, C.G. and Subramaniam, S. (2003) Protein local structure prediction from sequence, Proteins: Structure, Function, and Bioinformatics, 50, 572-579. Kolodny, R., et al. (2002) Small libraries of protein fragments model native protein structures accurately, J Mol Biol, 323, 297307. Kolodny, R., Koehl, P. and Levitt, M. (2005) Comprehensive Evaluation of Protein Structure Alignment Methods: Scoring by Geometric Measures, Journal of Molecular Biology, 346, 11731188. Krissinel, E. and Henrick, K. (2004) Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions, Acta Crystallogr D, 60, 2256-2268. Murzin, A.G., et al. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures, J Mol Biol, 247, 536-540. Offmann, B., Tyagi, M., and de Brevern, A.G. (2007) Local Protein Structures, Current Bioinformatics, 2, 165-202. Orengo, C.A., et al. (1993) Identification and classification of protein fold families, Protein Engineering, 6, 485-500. Osadchy, M. and Kolodny, R. (2011) Maps of protein structure space reveal a fundamental relationship between protein structure and function, Proceedings of the National Academy of Sciences, 108, 12301-12306. Pei, J. and Grishin, N.V. (2004) Combining evolutionary and structural information for local protein structure prediction, Proteins: Structure, Function, and Bioinformatics, 56, 782-794. Rogen, P. and Fain, B. (2003) Automatic classification of protein structure by using Gauss integrals, Proc Natl Acad Sci U S A, 100, 119-124. Rost, B. (2001) Review: Protein Secondary Structure Prediction Continues to Rise, Journal of Structural Biology, 134, 204-218. Sander, O., Sommer, I. and Lengauer, T. (2006) Local protein structure prediction using discriminative models, BMC Bioinformatics, 7, 14. Shindyalov, I. and Bourne, P. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path, Protein Eng, 11, 739 - 747. Shindyalov, I.N. and Bourne, P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path, Protein Eng, 11, 739-747. Simons, K.T., et al. (1999) Ab initio protein structure prediction of CASP III targets using ROSETTA, Proteins: Structure, Function, and Genetics, 37, 171-176. Sims, G.E., Choi, I.G. and Kim, S.H. (2005) Protein conformational space in higher order phi-Psi maps, Proc Natl Acad Sci U S A, 102, 618-621. Soding, J. (2005) Protein homology detection by HMM-HMM comparison, Bioinformatics, 21, 951-960. Subbiah, S., Laurents, D.V. and Levitt, M. (1993) Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core, Curr Biol, 3, 141-148. Tenenbaum, J.B., Silva, V.d. and Langford, J.C. (2000) A Global Geometric Framework for Nonlinear Dimensionality Reduction, Science, 290, 2319-2323. Watson, J.D., Laskowski, R.A. and Thornton, J.M. (2005) Predicting protein function from sequence and structural data, Curr Opin Struct Biol, 15, 275-284. Winstanley, H.F., Abeln, S. and Deane, C.M. (2005) How old is your fold?, Bioinformatics, 21, 449-458. Yang, A.-S. and Wang, L.-y. (2003) Local structure prediction with local structure-based sequence profiles, Bioinformatics, 19, 1267-1274. Zotenko, E., O'Leary, D. and Przytycka, T. (2006) Secondary structure spatial conformation footprint: a novel method for fast protein structure comparison and classification, BMC Structural Biology, 6, 12. 19