Objective: to define a sequence space compatible with a given

advertisement
Task 1. Classify protein sequences within a given motif.
Objective: to define a sequence space compatible with a given protein structure
and understand the structure of this space with respect to mutations (inverse
folding problem or protein sequence design).
Create three independent sequence spaces for the target protein.
1.1. Sequence-derived space. Identify proteins with sequence homology to the
target protein by searching nonredundant databases such as SwissProt
and TrEMBL [Bairoch, 2000 #22] using PSI-BLAST [Altschul, 1997 #23].
In parallel with using the native sequence of the target protein as a query,
use a sequence constructed from the original one by utilizing a simplified
12-letter amino acid alphabet, which has been shown to produce correct
foldable structures in more than 90% of examined proteins as a variation
of the search strategy [Murphy, 2000 #10]. Generate sequence-bases
multiple alignment. Transform it into a profile, i.e. position-specific scoring
matrix (PSSM). Profile is generated by using the fold and function
assignment system algorithm (FFAS) [Rychlewski, 2000 #19]. Also
generate Hidden Markov Models (HMMs) both based on the alignment
and by using training algorithms [Eddy, 1998 #4].
1.2.
Structure-derived space. Sequences of proteins whose structures are
homologous to the target protein are extracted from the fold classification
based on structure-structure alignment of proteins database (FSSP)
[Holm, 1994 #17; Holm, 1998 #18] and the structural classification of
proteins database (SCOP) [Murzin, 1995 #24] using the SAP algorithm for
structure alignment [Orengo, 1992 #25]. We will also make use of the
Conserved Key Amino Acid Positions Database (CKAAP) [Li, 2002 #12]
for structure-structure alignment of nonhomologous, but structurally
conserved sequences. Structural alignments are used to derive structurebased multiple alignments, which is then transformed into a profile.
1.3.
Designed space. A set of sequences is designed for a given target
structure using all-atom models and a physical energy function. The
scoring energy function is derived for estimates of the physical forces that
stabilize native protein structures and includes van der Waals interactions,
electrostatics, and an environment-free energy [Koehl, 1999 #20]. (Good
description of the procedure in [Koehl, 2002 #2]) (There is another
interesting method of defining an “environmental scoring function” by
training using a set of different proteins with known structures [Chang,
2001 #13]. It is supposedly easier as compared to defining a scoring
function based on pairwise interactions between amino acids – we can
briefly discuss it). Rotamer library and the fixed backbone approximation
will be used as a simplifying assumption [Ponder, 1987 #26]. At later
stages, backbone flexibility may be incorporated into design algorithms.
Stochastic algorithms, including Monte Carlo (MC) methods [Metropolis,
1953 #27] and Genetic Algorithms (GA) [Holland, 1993 #28] will be used
for sequence optimization due to their better ability to deal with problems
of significant combinatorial complexity as compared with the deterministic
algorithms such as Self-Consistent Mean Field (SCMF) or Dead End
Elimination (DEE) methods [Voigt, 2000 #29], although the combination of
these approaches may later be used. (Good description of GA in [Raha,
2000 #11]). The designed sequences are tested for specificity to their
target backbone (should be incompatible with competing folds) by
computer threading using GenThreader [Jones, 1992 #21]. The designed
sequences produced multiple sequence alignment that describes the
sequence space compatible with the target structure.
1.3.1. Mike, I was thinking of the applicability of methods described in [Kleinberg,
1999 #7] for sequence design, but so far I’m having a problem with it – it
deals extensively with the Grand Canonical model, which uses the H/P
alphabet. At the end of the paper he infers that the same methods could
be used for larger alphabets, but it’s not clear to me that it’s welldeveloped at all. Correct me if I’m wrong. It would be nice though if one
could design an “evolutionary trajectory” between connected sequences –
another way to generate more diverse sequence spaces.
1.4.
Combine these three sets of sequences together and carry out an analysis
that would create a 3D profile (matrix) that reflects a probability of having a
particular amino acid (or actually a sequence of amino acids) in a
particular position in the structure. I guess, this is when we can use
machine learning techniques (e.g. support vector machines, not that I
know anything about that) since we can use this set of sequences as a
training set. Then once the model is trained, we can ask what kind of
mutations we can introduce into our target sequence in such way that it
will still be compatible with the protein fold.
Task 2. Identify key amino acid positions responsible for characterization
of protein function and behavior.
2.1.
We will make use of PSSMs and HMMs that we calculated for each of the
three sequence spaces. Structure and sequence-based ones were
derived from natural sequences, while the designed ones were… well
designed. All of them supposedly represent identical protein folds and are
in essence probabilities of having a particular AA in a certain position in
the linear sequence (here no 3D structure is taken into account). By
comparing those profiles between each other (standard profile-profile
alignment as the one used in PSI-BLAST) two important pieces of
information could be extracted. First, residues that are vitally important for
protein structure are expected to be well conserved between all three
spaces. Second, it has been shown that residues that are conserved in
native proteins due to the functional constraints (for instance, some
surface positions that form a charged surface that interacts with another
protein) cannot be predicted by sequence prediction algorithms [Raha,
2000 #11] since no functional data is used for the modeling. Therefore,
positions that are highly conserved in sequence and structure profiles but
are different in the designed one are likely to be functionally important.
Phase I proof-of-principle should be demonstrated for a family of well-known
proteins for which sufficient data, including numerous amino acid sequences and
associated function, salient sequence positions, and perhaps 3D structure, are
accessible to the principal investigator and available in the open literature to
benchmark the obtained results. Although the Phase I effort may not require the
full integration of 3D structure information into the sequence information, the
encoding scheme to the machine-learning algorithm should provide the
mechanisms for demonstration of such capability in the Phase II effort.
This can be accomplished by analyzing on of the sets of proteins from Pfam DB.
We can select a family with known structure and functions. This should not be a
problem.
All of the above I can expand significantly (methods in detail) except for 1.4. So
this is just a backbone – tell me if it’s making any sense. I will also write
Introduction and Possible uses of this technology besides the use in bioagent
detection.
Download