Solution to Prediction Problem Gina Cannarozzi Here is my prediction of the secondary structure for the family of Pathogenesis related proteins, TPX proteins and venom allergens. The following rules can be found in the tutorial but are repeated here. Parse rules (step 2): 2i gap 2ii apc proline 2iii distributed parse 2iv apc glycine 2v combination parse 2vi distributed combination parse 2vii string parse Interior algorithms (step 4): 4i interior algorithm 1 4ii interior algorithm 2 4iii interior algorithm 3 4iv interior algorithm 4 4v interior algorithm 5 Surface algorithms (step 5) 5i surface algorithm 1 5ii surface algorithm 2 5iii surface algorithm 3 To start, look at the section between 24 and 40. The gaps at 23 (rule 2i) and 41 (rule 2i) parse this section. I see no parses such as conserved or distributed prolines. The presence of the residues DNG and in one sequence at positions 27-29 is a secondary parse by rule 2vii. We can choose whether or not to use it. I will use a 5 state prediction of surface and interior. The five states are: strong interior (I), weak interior (i), don’t know (.), weak surface (s), strong surface (S). Predict the segment from 24-40 My predictions of surface and interior from residues 24-40 are: 1 24 I all hydrophobic 4ii 2 25 . don’t know 3 26 I conserved hydrophobic 4i 4 27 s 5ii 5 28 S variable with hydrophilic 5i 6 29 S variable with hydrophilic 5i 7 30 I all hydrophobic 4ii 8 31 I hydrophobic with CHQST 9 32 . don’t know 10 33 . don’t know 11 34 I conserved hydrophobic 4iii 12 35 I 4iii 13 36 s varying with hydrophilic 5ii 14 37 I hydrophobic split 4iii 15 38 I hydrophobic 4iv 16 39 s varying hydrophilic 5ii 17 40 S varying hydrophilic 5i Now I can put this string (I.IsSSII..IIsIIsS) on a helical wheel and see if it forms a helix. I use: http://www.site.uottawa.ca/~turcotte/resources/HelixWheel/ In the pull down menu, you can choose to color surface and interior residues red and blue, respectively. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. This is clearly a helix. As it turns out from the structure, alignment numbers 25 and 26 are a short beta strand. We may have found this if we had used the secondary parse. The next segment to predict is alignment numbers 45-52. The parse at 53 is a gap. 45 . don’t know; conserved C could be a disulfide bond 46 S varying in two groups with hydrophilics 5i 47 I all hydrophobic 4ii 48 S varying with hydrophilics 49 I 4iii 50 don’t know. 5iii 51 s 5i 52 s 5i Use the string .SISI.ss QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. One SISI pattern is enough to call it a strand. Predict the segment from 56-61. Gaps provide the parses here. The conserved G at position 55 is a parse by rule 2iv. 56 S 5iii 57 S 5iii 58 I 4iv 59 I 4v 60 s 61 i This is too short to be a helix. I predict a strand because there is one I-S-I pattern, it has a lot of interior residues and the strength of the predictions was high compared to the last segment. The section between 62 and 70 are parses because of the gaps. Predict Segment 71-88. Note that residues 86-88 have more P’s G’s and DSN. These are secondary parses and there are a lot of them so I will only predict from 70-85. My SI predictions are: 1 71 S varying with hydrophilic 5iii 2 72 s 5i 3 73 I hydrophobic 4v 4 74 I hydrophobic 4iv 5 75 s hydrophilic varying 5i 6 76 s 5i 7 77 . I 4i 8 78 . 9 79 s 5i 10 80 S 5iii 11 81 S 5i 12 82 S 5i 13 83 s 5i 14 84 I conserved hydrophobic 15 85 . 16 86. The string is ssiissi.sssssi.. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. This segment is also a helix. The slight overlap at the 4th and 11th members of the segment is normal. The helices may be slightly rotated from one family member to the next. 88 – 97 is full of gaps and will be considered a parse. Two positions don’t have gaps but we cant predict anything here other than parse. Predict the segment from 97 to 117. This is a long segment. Are there any possible parses to break it up? The almost conserved G at 111 is a possibility. It is strengthened by the presence of more Gs in 113. Keep this in mind. Another possibility is the NS at position 106. The alignment positions 98 (H), 99 (Y), 100 (T), 101(Q), 103 (V), and 104 (W) are all completely conserved. If you compare this conservation with the rest of the alignment, you can see that the conservation here is much higher. This indicates that this might be the active site of the protein meaning that these residues participate in the chemical reaction that is being catalyzed by this protein. Some amino acids are more frequently in active sites because they have side chains that can participate in chemical reactions. These are: CDEHKNQRSTY. So probably 98-101 is an active site. Around active sites it is difficult to predict the secondary structure because the conservation is due to the active site and not for the reasons that indicate which are surface and interior residues. So I will predict from 102 to 117. 102 I hydrophobic 103 I hydrophobic 104 I hydrophobic conserved 105 S hydrophilic 5ii 106 s 5iii 107 . 108 . 109 S hydrophilic variable 5i 110 I hydrophobic 4ii 111 I 2iv (possible parse) 112 don’t know; impossible to tell; could be C in disulfide bond 113 I (variable in n subgroups with no DEKRENDCHQST ; see slides) 114 . 115 I 116 S 117 . don’t know IIIss..Sii.I.IS. Looking at the possibilities with both secondary parses, I have the following: IIIS and S..Sii.I.IS. With the NS parse IIIss..Sii and .i.IS. With the G.G parse On a helical wheel with the G.G parse I get TIFF are QuickTime™ needed (Uncompressed) toand seeathis decompressor picture. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. are TIFF QuickTime™ needed (Uncompressed) toand seeathis decompressor picture. TIFF are QuickTime™ needed (Uncompressed) toand seeathis decompressor picture. which could be a helix but is not a strong signal. The i.iS. at alignment positions 113-116 could be a beta strand showing an alternating pattern of surface and interior residues. From the crystal structure alignment positions 98103 are a helix and 109-117 are a strand. The active site residues have made it difficult to predict the helix in the right place. Using the NS parse, I have IIIS which I would predict to be a buried strand based on length and strength of interior and surface characteristics. The other segment looks like this on a helical wheel which is could also be a helix. Areas around active sites are hard to predict because the reasons why certain amino acids are accepted at these positions are different than those we use to predict. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Predict the segment from 123 to 129. I get III.SI. which is too short to be a helix and very structured so I predict a strand here. Solution: The solution to this problem can be found in the literature. In the Journal of Molecular Biology (1997) 266, 576-593 is the structure of protein a in the alignment. Protein a is the pathogenesis-related protein P14a. The NMR structure can be found in the Protein Data Bank with entry number 1CFE. Here is the sequence. >1CFE:_|PDBID|CHAIN|SEQUENCEQNSPQDYLAVHNDARAQVGVGPM SWDANLASRAQNYANSRAGDCNLIHSGAGENLAKGGGDFTGRAAVQLWVSE RPSYNY ATNQCVGGKKCRHYTQVVWRNSVRLGCGRARCNNGWWFISCNYDPVGNWIG QRPY The structure can be found in Figure 5a from this paper. Notice that the numbering is the paper is the numbering in the sequence of the protein discussed in the paper, not our alignment. Since adding gaps changes the sequence numbering, you should add to your alignment the sequence numbers of this. The first M in sequence a in our alignment is the 23rd amino acid in the sequence of that protein. So the numbers in the paper can be found in our alignment by counting each amino acid (not gap) from 23 in sequence one. Eg M is 23, S is 24, W is 25, D is 26 etc. The A in front of the first gap is 51 in the paper and the G after the first gap at alignment position 55 is sequence position 52 in the paper. This paper also tries to identify active site residues by looking for “highly conserved solvent-accessible residues without an obvious role in the architecture of the protein.” They identify His48, Ser49 and His93 as being completely conserved (in our alignment Ser49 is not completely conserved) and in close proximity in three dimensional space, making them potential active site residues. Also interesting to notice is that if you look at Figure 5a in the paper and at figure 6a you can see that the strand D is buried in the middle of the structure. If you look at the prediction for strand D which is from sequence position 117-124 which is alignment position 123-130, then you can see that the prediction is largely interior which tells you that this strand is buried. Strand B (sequence number 53-58, alignment number 56-61) is also buried. The strand that I missed from sequence number 104-111 (alignment number