1 Online Resource 1. Phylogenetic analysis of soybean bZIPs proteins 2 3 MATERIALS AND METHODS 4 To provide an evolutionary framework for the discussion of protein functions, phylogenetic 5 hypotheses were inferred by Bayesian inference (BI) and maximum likelihood (ML) using BEAST 6 v1.7.2 (Huelsenbeck and Ronquist, 2001) and GARLI 2.0 (Zwickl, 2006), respectively. 7 First, a search for bZIP protein sequences in Glycine max was performed in GenBank 8 (http://blast.ncbi.nlm.nih.gov), and 148 sequences were selected (Online Resource 2). 9 Additionally, we selected 10 bZIP protein sequences related to pathogen response 10 (ACT66299_TabZIP; AAX20030_CAbZIP1; CAA71687_GHBF1; AAL27150_BZI1; 11 O22763_AtbZIP10; AAN61914_ PPI1; BAG24402_ SBZ1; NP_001234596_SlAREB1; P43273_ 12 TGA2; Q39234_ TGA3). The selected sequences were aligned using MAFFT version 7 (Katoh et. 13 al, 2002) and the JTT+G (Jones et. al, 1992) was calculated by ProtTest 3.2.1 (Darriba et. al, 2011) 14 as the best-fit amino acid substitution model according to AIC and BIC criteria. 15 The BI phylogenetic trees were calculated using the Bayesian Markov Chain Monte Carlo 16 (MCMC) method with 5x107 generations and a sample frequency of 104, using the JTT+G 17 substitution model. The convergence of the parameters was analyzed in TRACER v1.5.0 18 (http://beast.bio.ed.ac.uk/tracer), and the chain reached a stationary distribution after 5x105 19 generations. A total of 1% of the generated trees was burned to produce the consensus Bayesian 20 phylogenetic tree. 21 The JTT+G substitution model was also selected in the GARLI settings (datatype = 22 aminoacid; ratematrix = jones; statefrequencies = jones; ratehetmodel = gamma; numratecats = 4; 23 invariantsites = none), and the statistical support of the ML phylogenetic trees was calculated by 1 24 103 bootstrap replicates. The 50% majority rule consensus ML phylogenetic tree of all bootstrap 25 replicates was summarized using SumTrees of DendroPy 3.8.0 (Sukumaran et. al, 2010). 26 For comparison purposes among clusters in phylogenetic tree and protein functions, a 27 search for conserved domains was performed in the Pfam database (Sonnhammer et. Al, 1997). 28 Only domains predicted at the 1% level of significance were considered further. 29 30 RESULTS 31 In phylogenetic trees (Fig. S1), these ten classes were recovered as monophyletic clades 32 supported by moderate values of posterior probability (PP) (Bayesian tree) and bootstrap value 33 (BV) (ML tree), with PP > 85 and BV > 50; most of 119 proteins included in these clades share 34 the same pfam domains (http://pfam.sanger.ac.uk/). Based on such data, we also proposed an 35 additional group of nine related bZIP proteins that have a set of four distinct domains (Fig. S1). 36 Beyond these eleven groups, 20 (13.5%) bZIP proteins remained ungrouped (Fig. S1). 37 After the distribution of all soybean bZIP proteins annotated in GenBank into 11 classes, 38 we sought candidate sequences in the group of bZIP proteins characterized as responsive to 39 pathogens (Fig. S1) and also differentially expressed during ASR. For the purpose of determining 40 the bZIP transcription factors involved in response to P. pachyrhizi, we used data from subtractive 41 libraries containing clones of Inoculated plants vs. MOCK plants subtraction, based on resistant 42 soybean cultivar PI561356, deposited in the database of the GENOSOJA consortium 43 (http://bioinfo.cnpso.embrapa.br/genosoja/; Benko-Iseppon et al, 2012). We were able to verify 44 the differential expression of several members of plant transcription factors families involved in 45 plant defense, such as MYB, WRKY, AP2 / ERF, NAC and bZIP transcription factors families, 46 during ASR infection in PI561356 plants (unpublished data). Among the members of the bZIP 2 47 family, we chose four distinct soybean genes with high value of fragments per kilobase of exon 48 per million fragments mapped (FPKM; above 500) for further functional studies (GenBank access 49 numbers ABI34659, NP_001237027, XP_003543312 and XP_003525005). Thus, we selected 50 four bZIP members (whose GenBank accession numbers are highlighted in the tree) for functional 51 analysis (Fig. S1). Based on the phylogenetic analyses, two bZIP proteins selected grouped in E 52 class, and were used for analysis of response to infection by P. pachyrhizi (Fig. 1). 53 This is the class that includes TabZIP (ACT66299), a single bZIP protein that has been 54 functionally characterized in the response to infection by a wheat rust fungus, Puccinia striiformis 55 f. sp. Tritici (Zhang et al. 2009). The soybean proteins selected (XP_003543312 and 56 XP_003525005) are similar to Arabidopsis bZIP proteins from the E class, which have not been 57 assigned a defined biological function (Jakoby et al. 2002). Thus, we selected these two proteins 58 for functional studies and to analyze their expression during infection by P. pachyrhizi. The 59 proteins were named GmbZIPE1 (XP_003543312) and GmbZIPE2 (XP_003525005) because of 60 their similarity to Arabidopsis proteins in the E class (Jakoby et al. 2002). 61 The two other bZIP members selected (ABI34659 and NP_001237027) grouped in the C 62 class. The C class had the highest number of transcription factors in the bZIP family characterized 63 as responsive to pathogens (Fig. S1). The protein ABI34659 identified in GenBank as GmbZIP105 64 and the NP_001237027 protein identified as GmbZIP62 present strong similarity to members of 65 the C class that are responsive to pathogens (Fig. 1), such as G/HBF-1 and the soybean SBZ1 (Fig. 66 S1; Jakoby et al. 2002). 67 GmbZIP62 was previously characterized as responsive to abiotic stress (Liao et al. 2008); 68 its overexpression in Arabidopsis increased tolerance to drought, salinity and freezing (Liao et al. 69 2008). Thus, GmbZIP62 may be a general response factor to stresses in plants, including stress 3 70 caused by pathogens, a feature already described for other bZIPs proteins (Lee et al. 2006; Orellana 71 et al. 2010). 72 The proteins GmbZIP62, GmbZIP105, and GmbZIPE1 GmbZIPE2 were grouped in 73 different clades, reflecting structural differences that may be accompanied by functional 74 differences among these proteins (Fig. S1). 75 76 DISCUSSION 77 Phylogenetic analysis showed the great structural and functional diversity of the bZIP 78 family of transcription factors. Previous studies (Jakoby et al. 2002) proposed the separation of 79 Arabidopsis bZIPs into 10 classes or groups. In the soybean, the same 10 groups were proposed to 80 group the bZIP proteins according to their structural similarity with the Arabidopsis proteins (Liao 81 et al. 2008). The phylogenetic analysis performed in this study revealed the presence of at least 82 11 classes of soybean bZIP proteins, which demonstrates the existence of other functional classes 83 in addition to those previously described by Jakoby and coworkers (2002). Twenty of the 148 84 analyzed soybean proteins were not grouped in any of the 11 groups formed, most likely due to 85 the lack of the complete sequence of these proteins in the GenBank database or the fact that their 86 functional domains were not characterized as bZIP domains, although these proteins have been 87 predicted to be bZIP family proteins in previous studies (Liao et al. 2008). 88 A new class is also proposed, formed by proteins homologous to HB-PHD family proteins 89 in Arabidopsis (Ariel et al. 2007). The homeobox domain (HB) is a conserved motif of 60 amino 90 acids in transcription factors found in all eukaryotic organisms (Ariel et al. 2007). This motif folds 91 in a triple helix structure that is capable of interacting specifically with the target DNA (Ariel et 92 al. 2007). The PHD finger domain, a His-Cys3-Cys4 zinc finger, is found in many regulatory 4 93 proteins from plants and animals and is often associated with transcriptional regulation by 94 chromatin modification (Halbach et al. 2000). In transcription factors containing a homeobox 95 domain, the PHD finger is combined with a leucine zipper in an upstream position (Halbach et al. 96 2000). These domains together form a highly conserved region of 180 amino acids called ZIP / 97 PHDf, and it has been verified that the transcriptional activity of the PHD finger domain is masked 98 when it is in this long region (Halbach et al. 2000). Interestingly, little is known about the region 99 proximal to the basic leucine zipper domain, and these proteins have not been described in the 100 literature as proteins containing a bZIP domain. In addition to containing the protein domains 101 described for HB-PHD proteins, the 9 proteins of this group also have MEKHLA domains 102 (MEKHLA as conserved sequence of amino acids), which are similar to the PAS domain (Per, 103 Arnt, and Yes proteins). In eukaryotes, this domain is a signal detector in signaling pathways 104 (Dunham et al. 2003). This group also contains a START domain (StAR protein-related lipid- 105 transfer), which has a lipid-binding function, and can bind to cholesterol, phospholipids and 106 sphingolipids (Ponting et al. 1999), suggesting that these proteins may be anchored to the cell 107 membrane and act as signal receptors in plant cells. 108 Analysis of the composition of protein domains observed in each bZIP protein in soybean 109 verified a structural diversity found among monophyletic groups (Fig. S1). Members of groups A, 110 C, E, F, I and S contained signature basic leucine zipper (bZIP) domains, and the domains 111 identified as bZIP domains differ structurally within the classification proposed by Jakoby and 112 coworkers (2002). For example, the domain bZIP_C, found in 19 of the 20 members of the group 113 C, differs from the bZIP_2 and bZIP_1 domains by having a leucine zipper that contains nine 114 replicates containing seven leucines each (Jakoby et al. 2002). Unlike the members of these groups, 115 members of the other groups have several distinct domains that reflect the characteristics of their 5 116 functional roles. Most members of the group D (17) also have a DOG1 domain (Delay of 117 germination protein 1), which is related to the control of seed development (Jakoby et al. 2002), 118 while two members have a HSF (Heat shock factor) domain, a binding domain related to heat 119 shock promoters (Clos et al. 1990). Seven members of group G have a domain called MFMR 120 (multifunctional mosaic region), which has a crucial role in the activation of transcription (Jakoby 121 et al. 2002). Five members of group H have a RING-type zinc finger domain that most likely 122 functions in protein-protein interactions (Halbach et al. 2000). Although 20 proteins were not 123 grouped in any of the 11 monophyletic groups formed, 16 of them have domains with unknown 124 functions (DUF) that have structural similarity to the bZIP domain (DUF630, DUF632 and 125 DUF1664), while the other 3 proteins exhibited unrelated domains (Fig. S1). Ten proteins found 126 in the grouping showed no relevant domains (Fig. S1). 6 127 128 129 Fig. S1 Phylogeny of soybean bZIP proteins. The majority-rule consensus tree was obtained by 130 Bayesian MCMC coalescent analysis of 148 sequences of bZPIP proteins (see methods above). 7 131 The posterior probability values (PP) (expressed as probabilities) calculated using the best trees 132 found by MrBayes are shown beside each node. The second value (underlined) corresponds to 133 bootstrap values (BV) (expressed as probabilities) that define the clusters in the maximum 134 likelihood tree. The proteins selected for study are shown in bold and highlighted in yellow 135 (GmbZIPE1, XP_003543312; GmbZIPE2, XP_003525005; GmbZIP105, ABI34659; GmbZIP62, 136 NP_001237027). 8 137 References 138 Ariel FD, Manavella PA, Dezar CA, Chan RL (2007) The true story of the HD-Zip family. Trends 139 Plant Sci. 12(9):419-26. 140 Benko-Iseppon AM, Nepomuceno AL, Abdelnoor RV (2012) GENOSOJA - the Brazilian soybean 141 genome consortium: high throughput omics and beyond. Genetics and Molecular Biology, 142 35 (1, Suppl. 1), i-iv. 143 Clos J, Westwood JT, Becker PB, Wilson S, Lambert K, Wu C (1990) Molecular cloning and 144 expression of a hexameric Drosophila heat shock factor subject to negative regulation. 145 Cell. 30;63(5):1085-97. 146 147 Darriba D, Taboada GL, Doallo R, Posada D (2011) ProtTest 3: fast selection of best-fit models of protein evolution. Bioinformatics. 27:1164-1165. 148 Dunham CM, Dioum EM, Tuckerman JR, Gonzalez G, Scott WG, Gilles-Gonzalez MA (2003) A 149 distal arginine in oxygen-sensing heme-PAS domains is essential to ligand binding, signal 150 transduction, and structure. Biochemistry. 1;42(25):7701-8. 151 Halbach T, Scheer N, Werr W (2000) Transcriptional activation by the PHD finger is inhibited 152 through an adjacent leucine zipper that binds 14-3-3 proteins. Nucleic Acids 153 Res. 15;28(18):3542-50. 154 155 Huelsenbeck JP, Ronquist F (2001). MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics. 17, 754–755. 156 Jakoby M et al (2002) bZIP transcription factors in Arabidopsis. Trends Plant Sci. 7, 106–111. 157 Katoh K et al (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast 158 Fourier transform. Nucleic Acid Res. 30:3059-3066. 9 159 160 Lee SJ et al (2002) PPI1: A novel pathogen-induced basic region-leucine zipper (bZIP) transcription factor from pepper. Mol Plant Microbe Interact. 15, 540–548. 161 Liao Y et al (2008) Soybean GmbZIP44, GmbZIP62 and GmbZIP78 genes function as negative 162 regulator of ABA signaling and confer salt and freezing tolerance in transgenic 163 Arabidopsis. Planta. 228:225-240 164 Orellana S et al (2010) The transcription factor SlAREB1 confers drought, salt stress tolerance and 165 regulates biotic and abiotic stress-related genes in tomato. Plant Cell Environ. 33, 2191– 166 2208. 167 168 169 170 Ponting CP, Aravind L (1999) START: a lipid-binding domain in StAR, HD-ZIP and signalling proteins. Trends Biochem Sci. 24(4):130-2. Sukumaran J, Holder MT (2010) DendroPy: a Python library for phylogenetic computing. Bioinformatics (Oxford, England), 26:1569–71. 171 Zhang Y et al (2009) Cloning and characterization of a bZIP transcription factor gene in wheat 172 and its expression in response to stripe rust pathogen infection and abiotic stresses. 173 Physiological and Molecular Plant Pathology. 73, 88–94. 174 Zwickl D (2006) Genetic algorithm approaches for the phylogenetic analysis of large biological 175 sequence datasets under the maximum likelihood criterion. The University of Texas, 176 Austin, TX. 10