Supplementary Text A H3: The DNA sequence data allowed clarification of some of the Leu/Ile residue assignments in the peptide sequences (Peptides 2,8, 10,11); these two residues have the same mass, and thus cannot be resolved by mass spectrometry (Table 1). The DNA data also allowed the occasional correction of the mass spectrometry data. For peptide 8, residue 4 is a Lys, rather than a Gln; these residues have nearly identical masses and can difficult to distinguish in mass spectrometry. For peptide 11, the N-terminal residues VQD, which were the least confident assignments from mass spectrometry, were LDN in the cDNA-derived sequence. An alternative explanation for such discrepancies would be the presence of multiple isoforms for the protein. H8: The cDNA-predicted amino acid sequence contained both of the H8 peptides identified by mass spectrometry, and allowed correct assignment of the Leu/Ile residues as Leu. It also allowed the revision of the sequence for Peptide 1, where an extra Thr is present at residue 3. An alternative explanation for such discrepancies would be the presence of multiple isoforms for the protein. H9: The predicted N-terminal sequence from the cDNA clone differs slightly from Peptide 1 (Table 1) and the Edman data, with the cloned sequence suggesting that residue 5 may be a Ser and that residue 8 may be a Cys. Peptide 3 could also be identified in the cDNA-derived sequence, which allowed its Leu and Ile residues to be resolved from one another. 1 Supplementary Text B. Ambiguities in peptide sequencing by mass spectrometry Phenylalanine (F) may be a methionine (M) adduct, methionine sulphoxide, which has almost the same mass. Likewise, glutamine (Q) may be lysine (K), as both residues have very similar masses. Trypsin cleaves C-terminal to arginine (R) and lysine (K) residues, but either can be present internally in a tryptic peptide if digestion is incomplete, or definitely if one of these residues is followed immediately by a proline residue. 2 Supplementary Table S1. Amino acid composition of cDNA-predicted polypeptide segments compared with actual composition of H. dofleinii bands. Band % lengtha Asx Ser Glx Gly His Arg Thr Ala Pro Tyr Val Met Lys Ile Leu Phe H3 H8 H9 cDNAb 40% AAAc cDNAb 20% AAAc cDNAb,d 100% AAAc 10.5 7.3 7.7 8.9 3.2 3.2 5.7 10.9 3.6 4.5 6.5 2.8 6.9 2.8 8.9 3.6 8.7 11.6 13.3 16.8 1.0 5.2 5.2 7.0 4.9 2.5 5.5 0.4 4.9 3.8 7.0 2.4 12.2 3.0 16.6 7.6 1.5 3.0 3.0 4.5 3.0 3.0 6.1 1.5 4.5 9.1 13.6 4.5 9.7 10.2 15.0 13.6 0.7 5.5 4.9 6.8 6.3 3.0 5.5 0.4 5.0 3.5 7.5 2.5 17.9 8.3 6.2 6.2 2.8 4.8 1.4 9.0 4.8 6.2 7.6 0 5.5 4.1 4.8 4.1 15.0 8.0 11.6 10.5 1.3 4.4 4.6 8.0 3.1 2.4 5.7 5.0 7.0 4.8 5.8 2.6 Correlation with composition of H. dofleinii band (R)e and statistical significance (P)f R (cDNA) 0.82 0.62 0.65 P (cDNA) < 0.0001 0.011 0.006 R (SP ave) 0.89 0.88 0.89 P (SP ave) < 0.0001 < 0.0001 < 0.0001 a % length is an estimate of the completeness of the clone, i.e. the number of amino acid residues predicted from the cDNA segment as a percentage of the total number of residues in the best-matching S. purpuratus protein (H3, H8) or in the complete H. dofleinii ORF (H9). b Amino acid compositions of cDNA-predicted polypeptides were calculated using http://web.expasy.org/protparam/ c Amino acid analysis data of Peng et al. (2011). d Excluding signal peptide. e Spearman correlation coefficients (R-values) for linear regression of amino acid composition of each H. dofleinii band against that of the cognate cDNA-derived polypeptide segment (“cDNA”) or against the average composition of the SwissProt database (“SP ave”; Tompa 2002), calculated using GraphPad InStat v3.01. f Significance (two-tailed non-parametric P-values) of the correlation of amino acid composition of each H. dofleinii band with that of the cognate cDNA-derived polypeptide segment (“cDNA”) or with the average composition of the SwissProt database (“SP ave”; Tompa 2002), calculated using GraphPad Instat v3.01. 3 Supp. Table S2. Amino acid composition of best-match S. purpuratus proteins compared with actual composition of H. dofleinii bands. Band a Type Species Asx Ser Glx Gly His Arg Thr Ala Pro Tyr Val Met Lys Ile Leu Phe H2 H3 H6 H7 H8 H9 Peptide/nitrate transporter Spb Hd Transketolase FBP aldolase Transaldolase-like Exportin-6-like C-type lectin 6.9 8.3 6.5 6.6 1.3 3.4 10.8 9.0 4.4 4.2 7.7 3.1 7.6 9.7 5.1 9.7 9.8 16.2 16.1 1.1 3.6 3.9 7.8 3.2 1.8 4.6 n.d. 8.8 4.1 7.0 2.4 b b b Sp Hd Sp Hd Sp Hd Sp Hd Spb Hd 10.1 6.3 6.9 8.2 2.1 4.3 5.5 13.0 4.5 2.7 8.7 1.6 5.8 5.9 7.6 3.7 8.7 11.6 13.3 16.8 1.0 5.2 5.2 7.0 4.9 2.5 5.5 0.4 4.9 3.8 7.0 2.4 9.9 4.4 11.3 8.5 1.4 3.8 6.8 12.3 4.7 3.3 7.4 0.8 6.0 4.4 9.6 2.7 12.1 7.0 13.2 7.8 1.3 4.9 5.1 9.2 3.9 2.4 6.4 3.2 8.3 5.6 7.3 2.5 10.3 6.7 12.1 4.8 3.3 4.5 10.6 2.4 2.4 7.3 3.0 10.0 6.4 10.3 2.7 13.1 6.5 12.7 8.1 n.d. 4.1 5.0 9.4 3.2 1.9 6.4 3.8 9.4 5.3 8.3 3.0 8.5 8.8 11.8 3.8 2.7 3.8 4.9 4.4 4.4 2.5 5.8 4.4 4.7 8.2 13.4 6.8 9.7 10.2 15.0 13.6 0.7 5.5 4.9 6.8 6.3 3.0 5.5 0.4 5.0 3.5 7.5 2.5 11.3 13.8 8.1 7.5 4.4 1.3 6.9 3.1 3.8 5.0 5.7 1.9 3.8 3.8 5.7 3.8 15.0 8.0 11.6 10.5 1.3 4.4 4.6 8.0 3.1 2.4 5.7 5.0 7.0 4.8 5.8 2.6 Correlation with composition of H. dofleinii band (R) and statistical significance (P) 0.39 0.83 0.89 0.94 R (Echin) 0.15 < 0.0001 < 0.0001 < 0.0001 P (Echin) 0.90 0.89 0.94 0.92 R (SwPrt ave) < 0.0001 < 0.0001 < 0.0001 < 0.0001 P (SwPrt ave) a b 0.45 0.083 0.88 < 0.0001 0.50 0.047 0.89 < 0.0001 H4 is omitted because no amino acid analysis data is available for this band. Amino acid compositions of S. purpuratus proteins (“Sp”) were calculated from translated ORF sequences using http://web.expasy.org/protparam/, amino acid analysis data for H. dofleinii proteins (Hd) are from Peng et al. (2011). 4 Supp. Fig. S1. SignalP v4.1 predicition for cDNA-derived protein corresponding to H9. The server is at http://www.cbs.dtu.dk/services/SignalP/. Parameters are calculated as follows: C-score (raw cleavage site score) The output from the CS networks, which are trained to distinguish signal peptide cleavage sites from everything else. Note the position numbering of the cleavage site: the C-score is trained to be high at the position immediately after the cleavage site (the first residue in the mature protein). S-score (signal peptide score) The output from the SP networks, which are trained to distinguish positions within signal peptides from positions in the mature part of the proteins and from proteins without signal peptides. Y-score (combined cleavage site score) A combination (geometric average) of the C-score and the slope of the S-score, resulting in a better cleavage site prediction than the raw C-score alone. This is due to the fact that multiple high-peaking Cscores can be found in one sequence, where only one is the true cleavage site. The Y-score distinguishes between C-score peaks by choosing the one where the slope of the S-score is steep. The graphical output from SignalP (above) shows the three different scores, C, S and Y, for each position in the sequence. Parameter summary: Measure Position Value max. C 21 0.548 max. Y 21 0.699 max. S 17 0.935 mean S 1-20 0.891 D 1-20 0.803 Cutoff 0.450 Signal peptide? YES 5 Cleavage site between pos. 20 and 21: NRA-ED D=0.803 D-cutoff=0.450 Networks=SignalP-noTM In the parameter summary, the maximal values of the three scores are reported. In addition, the following two scores are shown: mean S The average S-score of the possible signal peptide (from position 1 to the position immediately before the maximal Y-score). D-score (discrimination score) A weighted average of the mean S and the max. Y scores. This is the score that is used to discriminate signal peptides from non-signal peptides. For non-secretory proteins all the scores represented in the SignalP output should ideally be very low (close to the negative target value of 0.1). 6 SpEchL2 TfCL Hd band 9 Consensus SpEchL2 TfCL Hd band 9 Consensus SpEchL2 TfCL Hd band 9 Consensus SpEchL2 TfCL Hd band 9 Consensus ----MFNKIVTILVIASSAVMLPVLPGCQAGGCGCPPLWTAFQNNCYRYFSVKNITWHGA 56 MLLFLFLFGLALGAVAPSGVDQEHIEALLQ--VNCPLLWISFNNHCYKYVSNR-MTWVDA 57 -----MASKFAVLLLLYTMVVVNRAEDKVS--FGCPKDWYAYNDNCYHYNAERRFTSTSG 53 : .:: : : * .** * :::::**:* : : :* .. CYkff ktW dA EMHCSGFSVPCSDVDSTISLGHLTSIHSKEEMTFLSVLYESIRSKVVTSTTYVWIGLHDQ 116 ELHCVSQDA------------NLVSIHSLEEHNFVKALIK----KSDITEERTWIGLSDI 101 ASYCRDHGA------------KLLYISSIEEFKFAGAIASS--RDADVLIPSLHIGLNNN 99 :* . .. :* * * ** .* .: . . *** : e fC LVSI S eE Fl WIGL TTEDSWEWSDGSSL----DYEIWEPGQPSSHNGNQDCVMFFSSNKYKWNDLACDSDGDTA 172 HKEGTWMWSDGSKV----DFVTWNQGQPDNHLANENCVHTNYYIAKKWNDALCSELYAFV 157 AREGDFVWSDGKKLSQAPNVVVWEPNQPNDLGHNQNCVVYRINN-YNVNDAPCSYVAGVF 158 *. : ****..: : *: .**.. *::** : ** *. w WsDGs y nW gePnn e CV l g Wnd C fI YHGSAYVCKLPQW 185 CQSRTVFS----- 165 CKKPRAN------ 165 : Ck Supp. Fig. S2. Multiple sequence alignment of C-lectins with the polypeptide from H. dofleinii band 9 cDNA. TfCL is a C-type lectin from the fish Trachidermus fasciatus (AFW17073.1); SpEchL2 is echinoidin-like isoform 2 from the holothurian S. purpuratus (XP_003726044.1); Consensus shows conserved motifs taken from the sequence logo for Ctype lectin class 2 (http://prosite.expasy.org/cgi-bin/prosite/sequence_logo.cgi?ac=PS50041), to which these sequences conform better than to class 1; capitals indicate more strongly conserved residues. The alignment was performed using Clustal-W2 at http://www.ebi.ac.uk; * identical : highly conserved . conserved. The invariant cysteines are highlighted in cyan (the pair that form one disulphide bond) and magenta (the pair that form the other). Predicted ligand-binding residues for the two known proteins (taken from their NCBI RefSeq entries) are shown in red. 7 Yeast TK Hd band 3 Yeast TK Hd band 3 Yeast TK Hd band 3 Yeast TK Hd band 3 Yeast TK Hd band 3 Yeast TK Hd band 3 1 MTQFTDIDKLAVSTIRILAVDTVSKANSGHPGAPLGMAPAAHVL-WSQMR :...:.:|..::::.:.|.||||...:..|....|| :..|: 1 --------EDVANKLREDSIESTTAAGSGHPTTCMSAAEVMSVLFFHTMK 49 42 50 MNPTNPDWINRDRFVLSNGHAVALLYSMLHLTGYDLSIEDLKQFRQLGSR .....|.....|||::|.|||..:||:.....|. ..:|:||..|::.|. 43 YKVDVPKDPANDRFIMSKGHAAPILYAAWAEAGL-FPVENLKNLRKIDSD 99 100 TPGHPEFELPGVEVTTGPLGQGISNAVGMAMAQANLAATYNKPGFTLSDN ..|||...|..|:|.||.||||:|...||| |.......:|. 92 LEGHPTPRLSFVDVATGSLGQGLSVGAGMA---------YTGKYLDKADY 149 150 YTYVFLGDGCLQEGISSEASSLAGHLKLGNLIAIYDDNKITIDGATSISF .||..||||...||...||.:.|.:.||.||:||:|.|::.....||:.. 133 RTYCLLGDGESAEGSVWEAMAFASYYKLDNLVAIFDVNRLGQSQPTSLQH 199 91 132 182 200 D-EDVAKRYEAYGWEVLYVENGN--EDLAGIAKAIAQAKLSKDKPTLIKM | |....|.||:|:.. ||.:|: ||| |||:..|...||||:.|.. 183 DMETYRLRCEAFGFNT-YVVDGHSVEDL---AKALHDASTVKDKPSCILA 246 247 TTTIGYGSLHAGSHSV---HGAPLKADDVKQLKSKFGFNPDKSFVVPQEV… .|..|.|: .|...: || 229 KTYKGKGA--KGIEDLEGWHG----------------------------- 293 228 247 Supp. Fig. 3. Sequence alignment of yeast transketolase (partial) with polypeptide from H. dofleinii band 3 cDNA. “Yeast TK” is the N-terminal portion of Saccharomyces cerevisiae transketolase (AAA35168.1) with catalytically essential residues highlighted in red and residues involved in cofactor or metal binding highlighted in cyan. These are as identified in http://www.uniprot.org/uniprot/P23254, Singleton et al. [1996, Biochemistry 35, 15865-15869], Wikner et al. [1997 Eur. J. Biochem. 233, 750-755] and Gerhardt et al. [2003, Plant Physiology 132, 1941–1949]. Sequence alignment is by EMBOSS Needle at http://www.ebi.ac.uk; key: “|” identical, “:” highly conserved; “.” conserved. 8