Using Protein Fragments for Searching atabases and Data-Mining Protein D

advertisement
Artificial Intelligence and Robotics Methods in Computational Biology: Papers from the AAAI 2013 Workshop
Using Protein Fragments for Searching
and Data-Mining Protein Databases
Chen Keasar and Rachel Kolodny
Ben Gurion University & Haifa University, Israel
chen@cs.bgu.ac.il, trachel@cs.haifa.ac.il
target of almost all medicines. Hence, developing fast and
accurate tools to study them is of great importance.
"Scientists" and "practitioners" study proteins from two
complementary perspectives. The scientists study currentday proteins to better understand protein evolution and to
characterize the physical and chemical constraints that
govern their behavior (e.g., protein folding and
interaction). For this purpose, data-mining the protein
database is useful. Alternatively, the practitioners need
tools that will help characterize proteins of interest. Most
importantly, given a specific protein, they wish to identify
related proteins that are better characterized, and from
which relevant knowledge can be derived. For this
purpose, fast and accurate search tools are useful. The two
perspectives are intertwined: insights to the properties or
the evolution of proteins can be used to design better tools,
and novel tools can be used to gain insights about the
nature of proteins.
We focus on computational methods that are structurebased. A protein structure can be described (sufficiently
well) by its backbone, and even only by its C-alpha atoms.
Namely, we represent a protein of length n by a sequence
of coordinates: a1, a2, …, an, where ai R3 i n .
There are less than 3n degrees of freedom in this chain,
because there are additional geometric constraints (e.g., the
distance between two consecutive atoms is nearly fixed).
These additional constraints have been characterized
empirically, and it was shown that the repertoire of short
backbone segments (often denoted fragments) in PDB
proteins is fairly limited (Kolodny, et al., 2002).
Abstract
Proteins are macro-molecules involved in virtually all of life
processes.
Protein sequence and structure data is
accumulated at an ever increasing rate in publicly-available
databases. To extract knowledge from these databases, we
need efficient and accurate tools; this is a major goal of
computational structural biology.
The tasks we consider are searching and mining protein
data; we rely on protein fragment libraries to build more
efficient tools. We describe FragBag – an example of using
fragment libraries to improve protein structural search. To
search for patterns in structure space, we discuss methods to
generate efficient low-dimensional maps. In particular, we
use these maps to identify patterns of functional diversity
and sequence diversity.
Finally, we discuss how to extend these methods to protein
sequences. To do this, one needs to predict local structure
from sequence; we survey previous work that suggests that
this is a very feasible task. Furthermore, we show that such
predictions can be used to improve sequence alignments.
Namely, protein fragments can be used to leverage protein
structural data to improve remote homology detection.
Searching and data-mining protein databases
To extract "knowledge" from the rapidly expanding protein
databases, we need computational tools that can search and
data-mine. We focus on the database of protein structures:
the PDB (Berman, et al., 2000), with its over 80,000
entries; the databases of protein sequences are 1-2 orders
of magnitude larger. These databases describe current-day
proteins. Evolutionary theory suggests that these emerged
from ancient proteins through a series of mutations and
duplications at the sequence level.
Thus, sequence
similarity of two proteins is considered a sign of
homology, or evolutionary relatedness. In cases where the
sequences diverged and are no longer similar, similarity of
the (more conserved) structures may hint to a remote
homology. Proteins are, arguably, the most important
macromolecules in the process of life; they are also the
Protein structure can be discretized using protein
fragments
The restricted repertoire of local protein structures can be
described at different levels of detail. The coarsest level is
secondary structure, which classifies each residue as part
of a helix, strand, or coil. Fragment libraries describe local
14
structure in more detail via collections of commonly
occurring backbone fragments (Kolodny, et al., 2002).
Fragment libraries vary in their construction, in their
number and lengths of fragments, and consequently in the
accuracy by which their elements can approximate protein
structure (see, for example, a review by (Offmann, 2007)).
From a practical perspective, fragment libraries are useful:
We can describe protein structures as letter strings which
correspond to the indices of the best-approximating library
fragments for each segment along the protein backbone.
This is, in effect, a discretization scheme for protein
structure.
To some extent, scholars can predict local structure, or the
structure of protein fragments from their sequence.
Secondary structure prediction classifies residues to one of
three states and reaches a 75%-80% correct prediction rate
(Rost, 2001). Fragment library predictions are also
common: yet the success rates of predictions by different
groups (Benros, et al., 2006; Bystroff and Baker, 1998;
Bystroff, et al., 2000; De Brevern, et al., 2007; de Brevern,
et al., 2000; de Brevern, et al., 2002; Etchebest, et al.,
2005; Faraggi, et al., 2009; Hunter and Subramaniam,
2003; Offmann, 2007; Pei and Grishin, 2004; Sander, et
al., 2006; Yang and Wang, 2003) cannot be directly
compared, because they classify to different libraries, of
different size and fragment lengths. Even more so,
because predicting the structure of shorter fragments is
easier than that of longer fragments (the former can be
deduced from the latter, and shorter fragments are better
represented in the PDB (Chivian, et al., 2005; Simons, et
al., 1999)). Current methods typically consider local
structure prediction a classification problem, where a
single class (secondary structure element or library
fragment) is assigned to each residue. However, when
designing a local structure prediction method, one must
keep in mind the application of these predictions. Often,
this requires optimizing the accuracy in terms of the
geometric agreement with the true structure (or its best
library approximation).
Below, we describe several
examples of such applications for search and data-mining
methods, and highlight their corresponding optimization
functions for protein fragment prediction.
of structures. Comparing a pair of structures is wellstudied: this is the so-called protein structural alignment
problem which has many solutions (e.g., (Kolodny, et al.,
2005; Krissinel and Henrick, 2004; Shindyalov and
Bourne, 1998; Subbiah, et al., 1993). Unfortunately,
structural alignment methods are relatively slow, implying
that the naïve implementation of structure search which
compares the query to all PDB proteins, one by one, using
structural alignment, is far too expensive computationally.
Instead, scholars suggested the "filter-and-refine" paradigm
for structure search (Aung and Tan, 2007). A filter method
quickly sifts through a large set of structures, and selects a
small candidate set to be structurally aligned by a more
accurate, yet computationally expensive, method. Filter
methods gain their speed by representing structures
abstractly – typically as vectors – and quickly comparing
these representations.
Furthermore, such vector
representations can be stored in an inverted index — a data
structure that enables fast retrieval of neighbors, even in
huge datasets [e.g., (Brin and Page, 1998)]. Filter methods
include PRIDE (Carugo and Pongor, 2002), SGM (Rogen
and Fain, 2003), a method by Choi et al.(Choi, et al.,
2004), an a method by Zotenko et al. (Zotenko, et al.,
2006).
FragBag is a fast and accurate filter method, which relies
on a discrete representation of protein structures using
fragment libraries (Budowski-Tal, et al., 2010). FragBag
represents structures succinctly as a bag of words (BOW)
of their backbone fragments. The BOW is a vector whose
entries count the number of times each library fragment
approximates a segment in the protein backbone. For
example, using a library of 400 fragments, each 11
residues long, a structure is modeled by a 400-long vector
in which each entry counts the number of times a specific
library fragment best approximates a backbone segment.
Using FragBag vectors, one can approximate the similarity
between two structures by the similarity between their
corresponding vectors.
FragBag is more accurate than other publicly-available
filters (Carugo and Pongor, 2002; Rogen and Fain, 2003;
Zotenko, et al., 2006). More importantly, FragBag detects
homologues as reliably as two highly trusted structural
alignment methods, STRUCTAL (Subbiah, et al., 1993)
and CE (Shindyalov and Bourne, 1998), yet runs several
orders of magnitude faster. Also, FragBag is rather robust:
its performance is only mildly affected by the
parameterization of the fragment library (Budowski-Tal, et
al., 2010) and by fraglet assignment errors (data not
shown). Interestingly, the success of FragBag reveals
something about the distribution of protein structures in
Nature, as it exploits the property that protein domain
structures which have similar local composition, tend to be
globally similar.
FragBag: A fast and accurate filter for structure search
based on fragment descriptors
In protein structural search of the PDB, the query is a
structure, specified by a chain in R3, and the goal is to find
all PDB proteins whose structure is sufficiently similar to it
(allowing insertions/deletions and modifications). More
specifically, we focus on the task of identifying proteins
with similar structures yet non-similar sequences, since
identifying proteins whose sequences and structures are
similar is far easier. Namely, we would like to quickly and
accurately compare a single query structure vs. a large set
15
is far higher in the all-beta regions than it is in the allalpha regions.
Figure 1: Sequence diversity maps of protein structure
space (from two views); the SCOP-class maps are shown
on the left for comparison. We see that sequence diversity
16
functional and sequence diversity, which are defined at
each point of structure space through a whole collection of
structures in the vicinity of that point.
By coloring the maps according to the values of these
properties, we can visualize their distribution across
structure space. (Osadchy and Kolodny, 2011) study
functional diversity in structure space. Functional diversity
measures the variability of function in the structural
vicinity of each protein. Notice that to measure the
functional diversity in a meaningful manner, we must map
redundant and hence fairly large, datasets. There, they
show that protein structure space has a functionally diverse
core and that diversity drops toward the periphery of the
space; furthermore, this is a fundamental characteristic
pattern of structure space (Osadchy and Kolodny, 2011).
The highly diverse core of structure space includes mainly
alpha/beta domains, which were suggested by phylogenetic
analysis as most ancient (Winstanley, et al., 2005).
This observation has practical value for protein function
prediction based on structural similarity: it suggests that if
the protein lies in the periphery of structure space, then its
neighbors have relatively few functions that need to be
considered. If, on the other hand, the protein lies in the
functionally diverse core, then its neighbors have jointly
many functions to consider. To demonstrate that this is
indeed the case (Osadchy and Kolodny, 2011) analyze
Watson et al.’s protein function predictions from structural
similarity (Watson, et al., 2005), and show that indeed,
they were more successful in predicting function from
structure for proteins lying in less diverse regions of
structure space
Data-mining protein structure space using its threedimensional maps
Using protein fragment libraries, we can efficiently
calculate maps of protein structure space (Osadchy and
Kolodny, 2011). Maps are low (2 or 3) dimensional
visualizations of structure space. In these maps, we hope
to identify meaningful patterns, or to data-mine, protein
structure space. Maps of structure space were originally
studied by Orengo et al., (Orengo, et al., 1993), Holm and
Sander (Holm and Sander, 1996), and then by Kim and
colleagues (Hou, et al., 2005; Hou, et al., 2003; Sims, et
al., 2005). In these maps, protein structures are drawn as
points (in 2 or 3 dimensions), so that the distance between
any two points depends on the structural similarity of the
proteins they represent. This provides a comprehensive
visualization of structure space, which is not constrained
by a hierarchical system such as the Structural
Classification of Proteins (SCOP) (Murzin, et al., 1995).
The maps in the above-mentioned studies were calculated
using a computational procedure called multidimensional
scaling (MDS) (Tenenbaum, et al., 2000). To calculate
these maps for N protein structures, one first needs to
calculate all NxN structural similarities amongst them
(using any structural alignment method). Then, MDS
converts this NxN matrix to a 3xN (or 2xN) of the 3 (or 2)
dimensional coordinates for these N proteins.
Unfortunately, the calculation requires an eigenvector
decomposition of the NxN matrix, thus restricting N, the
number of protein structures in a map.
Instead, one can rely on the FragBag model and calculate
low dimensional maps of structure space far more
efficiently (Osadchy and Kolodny, 2011). To calculate
these alternative maps for N protein structures, one needs
to calculate the FragBag vectors of these proteins: each
structure is a vector of size L (where L is the number of
fragments in the library); these protein structures can be
thus viewed as points in an L dimensional space. Here, we
first normalize each FragBag vector, so the Euclidean
distance in L dimensions is (a constant factor of the) cosine
distance between the FragBag vectors. To project these to
a lower dimension, we can then use Principal Component
Analysis (PCA). The maps generated by PDA and MDS
maps are the same (up to a reflection and rotation of the
entire space) if the distances in the MDS matrix are the
Euclidean distances between the vectors in the PCA
matrix. However, the PCA calculation is far more efficient
because it only requires an eigenvalue decomposition of an
L × L matrix (L = 400 in our case), regardless of the
database size (N).
Studying Sequence Diversity
Figure 1 shows three dimensional maps of protein structure
space, for the sequence diversity in this space. To
calculate sequence diversity, we align the sequence (using
BLAST) of each protein in the dataset (of over 31000
domains) to 20 (randomly sampled) proteins whose
structures lie in its near vicinity in structure space (within
0.0005 in the three dimensional space); then, we average
the sequence similarity to these proteins. The average
sequence similarity can be used as a measure of the
sequence diversity in this vicinity. We color-code the
point by the sequence diversity: cases of low diversity (i.e.,
sequence similarity is high) are colored in red and those
with high diversity (i.e., sequence similarity is low) are
colored in light blue. This reveals another fundamental
characteristic of structure space: sequence diversity in the :
all-beta regions is far higher than the sequence diversity in
the all-alpha regions.
Studying Functional Diversity
Using fragment libraries and PCA, we can map a very
large set of >30,000 protein structures. Rather than
studying single structures, we focus on properties such as
17
each 11 residues long. We see that the filter based on local
structure is the top-performer, and it identifies almost all
alignments that match similar sub-structures (e.g.,
GDT_TS greater than 0.8 or RMSD less than 2.5). As a
side note we mention that filtering by percent similarity
outperforms filtering by E-value.
Here, the filter is based on the true local structure of the
matched residues. Nonetheless, this is a meaningful
observation, because, as we survey above, local structure
can be predicted from sequence. Hence, we propose to use
local structure predictions to improve the accuracy of
sequence alignment methods.
Since we suggest a
particular filter, this suggests an optimization function for
designing a local structure predictor, as well as a
meaningful evaluation protocol.
We expect the
performance of the local structure filter to deteriorate when
using predictions; however, since the performance is very
good, we believe that this may still prove to be a valuable
direction.
Local structure can be used to identify
meaningful sequence alignments
Figure 2: global structure similarity of sequence
alignments found by HHSearch and PSI-BLAST, compared
to those of filtered sets of alignment based on local
structure similarity, percent similarity and E-values. We
see that local structure can help identify the alignments
that match globally similar sub-structures.
Future Directions
The tools that we describe above are structure-based, and
use protein structure descriptors that are based on fragment
libraries. Given the actual structure, it is straight-forward
to describe it in terms of a small set of fragments. Less
straight-forward, is calculating this description from the
protein sequence. Thus, we need to develop ways to
predict the fragment approximations from the protein
sequence. This will allow us extend these tools to search
and data-mine the far larger sequence databases.
Fortunately, predicting local structure is easier than
predicting full structures, and there is ample evidence that
it is very feasible. Importantly, to be used in these
applications, the local structure prediction scheme needs to
be optimized in terms of the average (local) geometric
similarity of the predictions with respect to the true
structure. We believe that using protein fragment libraries
will enable us to leverage protein structural information to
create more sensitive tools to study protein sequences.
Next, we consider a seemingly different topic and study
how well we can identify which sequence alignments
match regions in the proteins that are structurally similar.
This topic is closely related, as in this case too, we use
fragment libraries to improve out computational tools. To
do this, we measure the global geometric similarity of the
sub-structures matched by sequence alignments. More
specifically, we consider a representative set of (over
6,600) PDB proteins, and align all vs. all sequences using
two leading sequence alignment methods: PSI-BLAST
(Altschul, et al., 1997) and HHSearch (Soding, 2005).
Since we know the structures of the proteins in our dataset,
we can also measure the global geometric similarity of the
sub-structures matched by the alignment; we use two
measures: RMSD and GDT_TS.
Figure 2 shows histograms of the geometric similarity
scores RMSD and GDT_TS for the set of all sequence
alignments found by the two sequence aligners, along with
several filtered subsets of these alignments. An ideal filter
will maintain only the alignments that match geometrically
similar sub-structures (i.e., of low RMSD, or high
GDT_TS). Notice that this filter is applied after the
alignments are found (hence denoted a post-filter), which
is different from one in the filter-and-refine paradigm
mentioned above, that is applied before actually aligning
the proteins. We consider three filters: (1) by the sequence
similarity of the aligned residues (using BLOSUM62 and a
threshold value of 30%), (2) by the sequence alignment Evalue (and a threshold of 10-20), (3) by the local structure
agreement of matching fragments. The local structure
agreement filter is based on the library of 400 fragments,
References
Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a
new generation of protein database search programs, Nucl. Acids
Res., 25, 3389-3402.
Aung, Z. and Tan, K.-L. (2007) Rapid retrieval of protein
structures from databases, Drug Discovery Today, 12, 732-739.
Benros, C., et al. (2006) Assessing a novel approach for
predicting local 3D protein structures from sequence, Proteins,
62, 865-880.
Berman, H.M., et al. (2000) The Protein Data Bank, Nucleic
Acids Res, 28, 235-242.
Brin, S. and Page, L. (1998) The Anatomy of a Large-Scale
Hypertextual Web Search Engine, Computer Networks and ISDN
Systems, 30, 107-117.
18
Budowski-Tal, I., Nov, Y. and Kolodny, R. (2010) FragBag, an
accurate representation of protein structure, retrieves structural
neighbors from the entire PDB quickly and accurately, Proc Natl
Acad Sci U S A, 107, 3481-3486.
Bystroff, C. and Baker, D. (1998) Prediction of local structure in
proteins using a library of sequence-structure motifs, J Mol Biol,
281, 565-577.
Bystroff, C., Thorsson, V. and Baker, D. (2000) HMMSTR: a
hidden Markov model for local sequence-structure correlations in
proteins, J Mol Biol, 301, 173-190.
Carugo, O. and Pongor, S. (2002) Protein fold similarity
-
comparison, Journal of Molecular Biology, 315, 887-898.
Chivian, D., et al. (2005) Prediction of CASP-6 structures using
automated Robetta protocols, Proteins: Structure, Function, and
Bioinformatics, 9999, NA.
Choi, I.G., Kwon, J. and Kim, S.H. (2004) Local feature
frequency profile: a method to measure structural similarity in
proteins, Proc Natl Acad Sci U S A, 101, 3797-3802.
De Brevern, A.G., et al. (2007) "Pinning strategy": a novel
approach for predicting the backbone structure in terms of protein
blocks from sequence, J Biosci, 32, 51-70.
de Brevern, A.G., Etchebest, C. and Hazout, S. (2000) Bayesian
probabilistic approach for predicting backbone structures in terms
of protein blocks, Proteins, 41, 271-287.
de Brevern, A.G., et al. (2002) Extension of a local backbone
description using a structural alphabet: a new approach to the
sequence-structure relationship, Protein Sci, 11, 2871-2886.
Etchebest, C., et al. (2005) A structural alphabet for local protein
structures: improved prediction methods, Proteins, 59, 810-827.
Faraggi, E., et al. (2009) Predicting Continuous Local Structure
and the Effect of Its Substitution for Secondary Structure in
Fragment-Free Protein Structure Prediction, Structure (London,
England : 1993), 17, 1515-1527.
Holm, L. and Sander, C. (1996) Mapping the protein universe,
Science, 273, 595-603.
Hou, J., et al. (2005) Global mapping of the protein structure
space and application in structure-based inference of protein
function, Proc Natl Acad Sci U S A, 102, 3651-3656.
Hou, J., et al. (2003) A global representation of the protein fold
space, Proc Natl Acad Sci U S A, 100, 2386-2390.
Hunter, C.G. and Subramaniam, S. (2003) Protein local structure
prediction from sequence, Proteins: Structure, Function, and
Bioinformatics, 50, 572-579.
Kolodny, R., et al. (2002) Small libraries of protein fragments
model native protein structures accurately, J Mol Biol, 323, 297307.
Kolodny, R., Koehl, P. and Levitt, M. (2005) Comprehensive
Evaluation of Protein Structure Alignment Methods: Scoring by
Geometric Measures, Journal of Molecular Biology, 346, 11731188.
Krissinel, E. and Henrick, K. (2004) Secondary-structure
matching (SSM), a new tool for fast protein structure alignment
in three dimensions, Acta Crystallogr D, 60, 2256-2268.
Murzin, A.G., et al. (1995) SCOP: a structural classification of
proteins database for the investigation of sequences and
structures, J Mol Biol, 247, 536-540.
Offmann, B., Tyagi, M., and de Brevern, A.G. (2007) Local
Protein Structures, Current Bioinformatics, 2, 165-202.
Orengo, C.A., et al. (1993) Identification and classification of
protein fold families, Protein Engineering, 6, 485-500.
Osadchy, M. and Kolodny, R. (2011) Maps of protein structure
space reveal a fundamental relationship between protein structure
and function, Proceedings of the National Academy of Sciences,
108, 12301-12306.
Pei, J. and Grishin, N.V. (2004) Combining evolutionary and
structural information for local protein structure prediction,
Proteins: Structure, Function, and Bioinformatics, 56, 782-794.
Rogen, P. and Fain, B. (2003) Automatic classification of protein
structure by using Gauss integrals, Proc Natl Acad Sci U S A,
100, 119-124.
Rost, B. (2001) Review: Protein Secondary Structure Prediction
Continues to Rise, Journal of Structural Biology, 134, 204-218.
Sander, O., Sommer, I. and Lengauer, T. (2006) Local protein
structure prediction using discriminative models, BMC
Bioinformatics, 7, 14.
Shindyalov, I. and Bourne, P. (1998) Protein structure alignment
by incremental combinatorial extension (CE) of the optimal path,
Protein Eng, 11, 739 - 747.
Shindyalov, I.N. and Bourne, P.E. (1998) Protein structure
alignment by incremental combinatorial extension (CE) of the
optimal path, Protein Eng, 11, 739-747.
Simons, K.T., et al. (1999) Ab initio protein structure prediction
of CASP III targets using ROSETTA, Proteins: Structure,
Function, and Genetics, 37, 171-176.
Sims, G.E., Choi, I.G. and Kim, S.H. (2005) Protein
conformational space in higher order phi-Psi maps, Proc Natl
Acad Sci U S A, 102, 618-621.
Soding, J. (2005) Protein homology detection by HMM-HMM
comparison, Bioinformatics, 21, 951-960.
Subbiah, S., Laurents, D.V. and Levitt, M. (1993) Structural
similarity of DNA-binding domains of bacteriophage repressors
and the globin core, Curr Biol, 3, 141-148.
Tenenbaum, J.B., Silva, V.d. and Langford, J.C. (2000) A Global
Geometric Framework for Nonlinear Dimensionality Reduction,
Science, 290, 2319-2323.
Watson, J.D., Laskowski, R.A. and Thornton, J.M. (2005)
Predicting protein function from sequence and structural data,
Curr Opin Struct Biol, 15, 275-284.
Winstanley, H.F., Abeln, S. and Deane, C.M. (2005) How old is
your fold?, Bioinformatics, 21, 449-458.
Yang, A.-S. and Wang, L.-y. (2003) Local structure prediction
with local structure-based sequence profiles, Bioinformatics, 19,
1267-1274.
Zotenko, E., O'Leary, D. and Przytycka, T. (2006) Secondary
structure spatial conformation footprint: a novel method for fast
protein structure comparison and classification, BMC Structural
Biology, 6, 12.
19
Download