Task 1. Classify protein sequences within a given motif. Objective: to define a sequence space compatible with a given protein structure and understand the structure of this space with respect to mutations (inverse folding problem or protein sequence design). Create three independent sequence spaces for the target protein. 1.1. Sequence-derived space. Identify proteins with sequence homology to the target protein by searching nonredundant databases such as SwissProt and TrEMBL [Bairoch, 2000 #22] using PSI-BLAST [Altschul, 1997 #23]. In parallel with using the native sequence of the target protein as a query, use a sequence constructed from the original one by utilizing a simplified 12-letter amino acid alphabet, which has been shown to produce correct foldable structures in more than 90% of examined proteins as a variation of the search strategy [Murphy, 2000 #10]. Generate sequence-bases multiple alignment. Transform it into a profile, i.e. position-specific scoring matrix (PSSM). Profile is generated by using the fold and function assignment system algorithm (FFAS) [Rychlewski, 2000 #19]. Also generate Hidden Markov Models (HMMs) both based on the alignment and by using training algorithms [Eddy, 1998 #4]. 1.2. Structure-derived space. Sequences of proteins whose structures are homologous to the target protein are extracted from the fold classification based on structure-structure alignment of proteins database (FSSP) [Holm, 1994 #17; Holm, 1998 #18] and the structural classification of proteins database (SCOP) [Murzin, 1995 #24] using the SAP algorithm for structure alignment [Orengo, 1992 #25]. We will also make use of the Conserved Key Amino Acid Positions Database (CKAAP) [Li, 2002 #12] for structure-structure alignment of nonhomologous, but structurally conserved sequences. Structural alignments are used to derive structurebased multiple alignments, which is then transformed into a profile. 1.3. Designed space. A set of sequences is designed for a given target structure using all-atom models and a physical energy function. The scoring energy function is derived for estimates of the physical forces that stabilize native protein structures and includes van der Waals interactions, electrostatics, and an environment-free energy [Koehl, 1999 #20]. (Good description of the procedure in [Koehl, 2002 #2]) (There is another interesting method of defining an “environmental scoring function” by training using a set of different proteins with known structures [Chang, 2001 #13]. It is supposedly easier as compared to defining a scoring function based on pairwise interactions between amino acids – we can briefly discuss it). Rotamer library and the fixed backbone approximation will be used as a simplifying assumption [Ponder, 1987 #26]. At later stages, backbone flexibility may be incorporated into design algorithms. Stochastic algorithms, including Monte Carlo (MC) methods [Metropolis, 1953 #27] and Genetic Algorithms (GA) [Holland, 1993 #28] will be used for sequence optimization due to their better ability to deal with problems of significant combinatorial complexity as compared with the deterministic algorithms such as Self-Consistent Mean Field (SCMF) or Dead End Elimination (DEE) methods [Voigt, 2000 #29], although the combination of these approaches may later be used. (Good description of GA in [Raha, 2000 #11]). The designed sequences are tested for specificity to their target backbone (should be incompatible with competing folds) by computer threading using GenThreader [Jones, 1992 #21]. The designed sequences produced multiple sequence alignment that describes the sequence space compatible with the target structure. 1.3.1. Mike, I was thinking of the applicability of methods described in [Kleinberg, 1999 #7] for sequence design, but so far I’m having a problem with it – it deals extensively with the Grand Canonical model, which uses the H/P alphabet. At the end of the paper he infers that the same methods could be used for larger alphabets, but it’s not clear to me that it’s welldeveloped at all. Correct me if I’m wrong. It would be nice though if one could design an “evolutionary trajectory” between connected sequences – another way to generate more diverse sequence spaces. 1.4. Combine these three sets of sequences together and carry out an analysis that would create a 3D profile (matrix) that reflects a probability of having a particular amino acid (or actually a sequence of amino acids) in a particular position in the structure. I guess, this is when we can use machine learning techniques (e.g. support vector machines, not that I know anything about that) since we can use this set of sequences as a training set. Then once the model is trained, we can ask what kind of mutations we can introduce into our target sequence in such way that it will still be compatible with the protein fold. Task 2. Identify key amino acid positions responsible for characterization of protein function and behavior. 2.1. We will make use of PSSMs and HMMs that we calculated for each of the three sequence spaces. Structure and sequence-based ones were derived from natural sequences, while the designed ones were… well designed. All of them supposedly represent identical protein folds and are in essence probabilities of having a particular AA in a certain position in the linear sequence (here no 3D structure is taken into account). By comparing those profiles between each other (standard profile-profile alignment as the one used in PSI-BLAST) two important pieces of information could be extracted. First, residues that are vitally important for protein structure are expected to be well conserved between all three spaces. Second, it has been shown that residues that are conserved in native proteins due to the functional constraints (for instance, some surface positions that form a charged surface that interacts with another protein) cannot be predicted by sequence prediction algorithms [Raha, 2000 #11] since no functional data is used for the modeling. Therefore, positions that are highly conserved in sequence and structure profiles but are different in the designed one are likely to be functionally important. Phase I proof-of-principle should be demonstrated for a family of well-known proteins for which sufficient data, including numerous amino acid sequences and associated function, salient sequence positions, and perhaps 3D structure, are accessible to the principal investigator and available in the open literature to benchmark the obtained results. Although the Phase I effort may not require the full integration of 3D structure information into the sequence information, the encoding scheme to the machine-learning algorithm should provide the mechanisms for demonstration of such capability in the Phase II effort. This can be accomplished by analyzing on of the sets of proteins from Pfam DB. We can select a family with known structure and functions. This should not be a problem. All of the above I can expand significantly (methods in detail) except for 1.4. So this is just a backbone – tell me if it’s making any sense. I will also write Introduction and Possible uses of this technology besides the use in bioagent detection.