schleiff_supp

advertisement
Electronic Supplementary Material
Schleiff et al. (2003) Protein Sci. 12(4)
Experimental procedure
Analysis of the proteins using ChloroP and Target P
The genomic derived sequences of the A. thaliana genome as deposited in the file named
ATH1.pep
located
at
the
TIGR
FTP-Server
(ftp://FTP.tigr.org:21/pub/data/a_thaliana/ath1/SEQUENCES/ATH1.pep) were divided into blocks
of
50
proteins
and
interactively
transferred
to
the
ChloroP
program
(http://www.cbs.dtu.dk/services/ChloroP/). The same procedure was used to analyse the proteins by
the program TargetP (http://www.cbs.dtu.dk/services/TargetP/) with the only difference, that files
containing up to 1500 sequences were submitted.
Description of the used program units
The software presented in this article consists of five short command line programs written
in ANSI-C. All programs have a huge amount of source in common, which will enable us in the
near future to combine all routines in one program controlled by a set of parameters. After a
description of the source code used by all of these programs special features of each part of the
program are summarised. The programs or its later version, a source code description and the
instruction for the use of the program will be provided upon request.
Every program checks the first 24 characters of each line of a FASTA file designed as
ATH1.pep to dissect description-lines containing digits and protein sequences consisting only of
letters and *, which are subsequently stored in two separate arrays. The end of a protein is
determined by the identification of the description-line of the following sequence. Then, all nonamino acid characters were removed from the sequence. Calculated data for protein sequences with
annotated amino acids named “X“ are written to a separate file. The length of the cleaned sequence
is used to generate dynamic arrays intended for data storage. For every amino acid the barrel
probability values were used as presented in {(Wimley 2002), the pKa values and the molecular
mass were assigned and each stored in arrays in the order of the respective amino acid sequence.
Subsequently, exact strand score (EBS), hairpin score (HS) and barrel score (BBS) were
calculated and resulting data were saved in numbered files containing up to 50 sequences according
to their BBS. Sequences too short for hairpin analysis were stored in a separate file. After writing
the data to files the description-line of the following protein still remaining in the input buffer is
now transferred to its previously cleared target array and the process starts for the next sequence. At
the end of the ATH1.pep file all files are closed. Some of the programs use an additional file
containing a certain subset of proteins from the source file ATH1.pep. Calculations will only be
performed in the described way if the sequence identified by TIGR-ID, BAC-Locus and Chr-Locus
is also present in the additional file.
In the following, the special features of the different programs are explained. Porgram 1
determines the EBS, the amount of predicted strands, the HS, the BBS, the sequence length, the
isoelectric point and the molecular mass for every protein sequence and writes the results to an
output file. Program 2 creates two lists sorting the proteins either according to the BBS or based on
the amount of predicted sheets, which also contain the TIGR-ID, BAC-Locus and Chr-Locus.
Program 3 screens for continuous blocks with anEBS greater than two, determines the amount of
those peaks and their number of the sheet scores and finally forms the quotient between these
values. The output file also contains in addition to the description-line sequence, the EBS, HS and
BBS. In program 4, the isoelectric point of the protein sequence will be calculated. If the isoelectric
point is greater or equal to pH 7 the data will be written to an output file according to their BBS.
Subsequently, program 5 creates files with up to 1500 proteins consisting of the TIGR-ID, BACLocus, Chr-Locus and the respective protein sequences by selecting all proteins with an BBS above
0.7. These sequences are sent to the TargetP-Server for identification of signal peptides for other
cell compartments than chloroplasts.
2
Supplementary Figures
Fig. A: Peptide sequences of the protein in Ref. No. 23 (compare Fig. 3 and Table 1)
Peptide sequences and their nominal masses and charges obtained by massspectrometric analysis of
Ref. No. 23 are shown. * no amino acid could be clearly assigned.
Fig. B: Selection of proteins by the barrel score.
The barrel score (BBS) of the genomic derived sequences were calculated using the exact
strand score (E1 and E2) and sorted to classes between integer values. The total number (as
indicated) of the sequences of one class is given in A. The percent amount of proteins with a value
higher then a given BBS is given in B.
Fig. C: Distribution of sequences sorted by their isoelectric point
All sequences before (grey bars) and after (white bars) manual selection were sorted according to
the calculated pI.
Fig. D: Distribution of sequences sorted by their amino acid number
All sequences before (grey bars) and after (white bars) manual selection were sorted according to
the number of amino acids. Before manual selection a bi-Gaussian distribution was observed
(dashed line). After manual selection a mono-Gaussian distribution was observed (solid line).
Fig. E. The alignment between Toc75-V and the close homologue AL133315.
Black underlined letters indicate identical and grey underlined letters homologue amino acids. The
grey boxes indicate the transmembrane segments 7 to 18 identified in Toc75-V as shown in Figure
5.
3
Table A. Results transmembrane domain prediction
Several on the internet available programs were used to predict the transmembrane helices of the
non-barrel proteins as described. Each region predicted by at least three programs to form a
membrane helix is listed. The probability is given as certain (a value above the threshhold used by
the prediction program) or putative (a value close to the threshhold used by the prediction program).
GENE NUMBER
TM REGION (AA)
DAS
HMMTOP
TMPRED
TOPPRED
102 - 118
putative
certain
putative
certain
125 - 142
putative.
certain
certain
putative
24 - 38
certain
not
certain
certain
418 - 434
certain
certain
certain
putative
449 - 446
putative
certain
certain
putative
497 - 514
certain
certain
certain
certain
529 - 546
putative
certain
certain
certain
553 - 570
certain
certain
certain
certain
637 - 655
putative
certain
putative
not
4 - 23
certain
certain
certain
certain
216 - 235
putative
certain
putative
putative
343 - 362
putative
certain
certain
putative
451 - 470
certain
certain
certain
certain
18 - 31
certain
not
certain
certain
414 – 433
putative
not
certain
certain
575 – 593
certain
not
certain
putative
At5g05000
263 - 286
certain
certain
certain
putative
At1g02280
267 - 286
certain
not
putative
putative
At2g28900
At2g01320
At3g06510
At5g09420
4
Download