Electronic Supplementary Material Schleiff et al. (2003) Protein Sci. 12(4) Experimental procedure Analysis of the proteins using ChloroP and Target P The genomic derived sequences of the A. thaliana genome as deposited in the file named ATH1.pep located at the TIGR FTP-Server (ftp://FTP.tigr.org:21/pub/data/a_thaliana/ath1/SEQUENCES/ATH1.pep) were divided into blocks of 50 proteins and interactively transferred to the ChloroP program (http://www.cbs.dtu.dk/services/ChloroP/). The same procedure was used to analyse the proteins by the program TargetP (http://www.cbs.dtu.dk/services/TargetP/) with the only difference, that files containing up to 1500 sequences were submitted. Description of the used program units The software presented in this article consists of five short command line programs written in ANSI-C. All programs have a huge amount of source in common, which will enable us in the near future to combine all routines in one program controlled by a set of parameters. After a description of the source code used by all of these programs special features of each part of the program are summarised. The programs or its later version, a source code description and the instruction for the use of the program will be provided upon request. Every program checks the first 24 characters of each line of a FASTA file designed as ATH1.pep to dissect description-lines containing digits and protein sequences consisting only of letters and *, which are subsequently stored in two separate arrays. The end of a protein is determined by the identification of the description-line of the following sequence. Then, all nonamino acid characters were removed from the sequence. Calculated data for protein sequences with annotated amino acids named “X“ are written to a separate file. The length of the cleaned sequence is used to generate dynamic arrays intended for data storage. For every amino acid the barrel probability values were used as presented in {(Wimley 2002), the pKa values and the molecular mass were assigned and each stored in arrays in the order of the respective amino acid sequence. Subsequently, exact strand score (EBS), hairpin score (HS) and barrel score (BBS) were calculated and resulting data were saved in numbered files containing up to 50 sequences according to their BBS. Sequences too short for hairpin analysis were stored in a separate file. After writing the data to files the description-line of the following protein still remaining in the input buffer is now transferred to its previously cleared target array and the process starts for the next sequence. At the end of the ATH1.pep file all files are closed. Some of the programs use an additional file containing a certain subset of proteins from the source file ATH1.pep. Calculations will only be performed in the described way if the sequence identified by TIGR-ID, BAC-Locus and Chr-Locus is also present in the additional file. In the following, the special features of the different programs are explained. Porgram 1 determines the EBS, the amount of predicted strands, the HS, the BBS, the sequence length, the isoelectric point and the molecular mass for every protein sequence and writes the results to an output file. Program 2 creates two lists sorting the proteins either according to the BBS or based on the amount of predicted sheets, which also contain the TIGR-ID, BAC-Locus and Chr-Locus. Program 3 screens for continuous blocks with anEBS greater than two, determines the amount of those peaks and their number of the sheet scores and finally forms the quotient between these values. The output file also contains in addition to the description-line sequence, the EBS, HS and BBS. In program 4, the isoelectric point of the protein sequence will be calculated. If the isoelectric point is greater or equal to pH 7 the data will be written to an output file according to their BBS. Subsequently, program 5 creates files with up to 1500 proteins consisting of the TIGR-ID, BACLocus, Chr-Locus and the respective protein sequences by selecting all proteins with an BBS above 0.7. These sequences are sent to the TargetP-Server for identification of signal peptides for other cell compartments than chloroplasts. 2 Supplementary Figures Fig. A: Peptide sequences of the protein in Ref. No. 23 (compare Fig. 3 and Table 1) Peptide sequences and their nominal masses and charges obtained by massspectrometric analysis of Ref. No. 23 are shown. * no amino acid could be clearly assigned. Fig. B: Selection of proteins by the barrel score. The barrel score (BBS) of the genomic derived sequences were calculated using the exact strand score (E1 and E2) and sorted to classes between integer values. The total number (as indicated) of the sequences of one class is given in A. The percent amount of proteins with a value higher then a given BBS is given in B. Fig. C: Distribution of sequences sorted by their isoelectric point All sequences before (grey bars) and after (white bars) manual selection were sorted according to the calculated pI. Fig. D: Distribution of sequences sorted by their amino acid number All sequences before (grey bars) and after (white bars) manual selection were sorted according to the number of amino acids. Before manual selection a bi-Gaussian distribution was observed (dashed line). After manual selection a mono-Gaussian distribution was observed (solid line). Fig. E. The alignment between Toc75-V and the close homologue AL133315. Black underlined letters indicate identical and grey underlined letters homologue amino acids. The grey boxes indicate the transmembrane segments 7 to 18 identified in Toc75-V as shown in Figure 5. 3 Table A. Results transmembrane domain prediction Several on the internet available programs were used to predict the transmembrane helices of the non-barrel proteins as described. Each region predicted by at least three programs to form a membrane helix is listed. The probability is given as certain (a value above the threshhold used by the prediction program) or putative (a value close to the threshhold used by the prediction program). GENE NUMBER TM REGION (AA) DAS HMMTOP TMPRED TOPPRED 102 - 118 putative certain putative certain 125 - 142 putative. certain certain putative 24 - 38 certain not certain certain 418 - 434 certain certain certain putative 449 - 446 putative certain certain putative 497 - 514 certain certain certain certain 529 - 546 putative certain certain certain 553 - 570 certain certain certain certain 637 - 655 putative certain putative not 4 - 23 certain certain certain certain 216 - 235 putative certain putative putative 343 - 362 putative certain certain putative 451 - 470 certain certain certain certain 18 - 31 certain not certain certain 414 – 433 putative not certain certain 575 – 593 certain not certain putative At5g05000 263 - 286 certain certain certain putative At1g02280 267 - 286 certain not putative putative At2g28900 At2g01320 At3g06510 At5g09420 4