Supporting information: Ab initio simulation of a 57-residue protein in explicit solvent reproduces the native conformation in the lowest free-energy cluster Jinzen Ikebe,1 Daron M. Standley,2 Haruki Nakamura,3 and Junichi Higo3 1 Graduate School of Frontier Biosciences, Osaka University, Open Laboratories for Advanced Bioscience and Biotechnology, 6-2-3 Furuedai, Suita, Osaka, 565-0874, Japan 2 Systems Immunology Lab, Immunology Frontier Research Center (IFReC), Osaka University, Suita, Osaka, 565-0871, Japan 3 Institute for Protein Research, Osaka University, Suita, Osaka, 565-0871, Japan Table: Terms and quantities used for analyses ------------------------------------------------------------------------------------------------Terms/quantities Meaning ------------------------------------------------------------------------------------------------H1 and H2 Regions of residues 2-21 and 26-47, respectively Core-region Region of residues 2-47 300 K dataset Thermodynamic ensemble of structures at 300 K Pseudo distance Structural dissimilarity of the core-region between structures Pseudo distance between a cluster and the native structure DNTV Clustering Average linkage clustering method to classify structures MDS Multidimensional scaling to generate a conformational space NOE distance Upper bound of atomic distance converted from an NOE signal Reproduction Ratio of reproduced NOE pairs to the experimentally obtained NOE pairs ratio of NOE distances calc Tolerance to judge if a computed NOE distance ( RNOE ) agrees with RNOE exp calc exp an experimental NOE distance ( RNOE ): RNOE RNOE RNOE Root mean square deviation of the backbone (N, C, and C atoms) RMSDcore for the core-region between a cluster and the native structure Q value Reproduction rate of the contacts in a sampled structure inter-residual Region(9-13) Region of residues 9-13, which is a part of the H1 region Number of residue-residue hydrophobic contacts between Nhc1 Region(9-13) and the other protein regions at 300 K 1 Number of residue-residue hydrophobic contacts between Region(9-13) and H2 at 300 K Number of water molecules in the vicinity of Region(9-13) at 300 K Nwat ------------------------------------------------------------------------------------------------- Nhc2 Multidimensional scaling To visualize relations among the obtained clusters, we constructed a conformational space with the multidimensional scaling (MDS) method.1,2 Given N clusters, MDS assigns a location to each cluster in an N -dimensional (i.e., full-dimensional) hyperspace, which can express the structural dissimilarities among the clusters, as follows:First, we defined a matrix B , for which an element b is an inner product among position vectors of clusters and in the hyperspace as b 1 d(,o)2 d(,o)2 d(, )2 , 2 (1) where o represents the origin point set arbitrarily in the hyperspace, and d(,o) , d(,o) , and d(, ) denote pseudo distances between two of the three (, , and o). Definition of the pseudo distance main text. The inter-cluster d(i, j) between two structures i and j is given in equation 5of the pseudo distance d(, ) is defined as an average over the pseudo distances between structures and belonging to clusters . To compute d(,o) , we suppose a cluster consisting of only the origin point. When the origin is set to the geometrical center of the N clusters, equation 1 is transformed as follows: b 1 1 N 1 N 1 N N 2 2 2 2 d ,r ds, 2 d s,r d , . 2 n r1 N s1 N r1 s1 (2) Here we assume a matrix X , for which the element xk is the kth coordinate of the th cluster in the N -dimensional hyperspace. Then, the matrix B is decomposed as B = XX t , where X t X , we performedan eigenvalue decomposition: is the transpose of X . To determine t B OD2O t ODOD XX t , (3) 2 where D 2 is a diagonal matrix, of which the mth diagonal element is the mth eigenvalue of B , and O is a matrix whose column m is the mth eigenvector of B . From equation 3, X is defined as the in the product of O and D . Finally, the th low of X represents the coordinates of the cluster N -dimensional hyperspace. Picking three components, which are assigned to the three largest the components are the eigenvalues, coordinates of the cluster projected in a 3D subspace, which most dominate the feature of the cluster distribution. Q value The reproduction rate (Q value) of native inter-residual contacts (native contacts) was calculated as follows: When the minimum heavy atomic distance rij between residues i and j ( i j 3 ) was smaller than 6.0 Å in an NMR model, the residual pair was noted a candidate of a native contact of two heavy atomic van residue (NCR) pair. The space of 6.0 Å was considered from the summation der Waals radii (≈ 2.0 Å) and a constant value of 2.0 Å, which is smaller than the diameter of a water molecule (≈ 3.0 Å). Then, candidates detected in more than two-thirds (= 14) of the 20 NMR models were registered as NCR pairs. The number of NCR pairs in the core-region ( N NCR) was 143. When rij for an NCR pair in a sampled structure was smaller than 6.5 Å, we judged that the native contact is enough to exclude a water reproduced in the sampled structure. The space of 6.5 Å is still small molecule from the inter-residual zones. Finally, Q was defined as Q N snap N NCR , where N snap is the number of native contacts reproduced in the sampled structure. We use the Q value as a measure of reproduction of the native topology. Correlation between charged residues We analyzed a role of the electrostatic interactions between the H1 and H2 regions in the folding of EPRS-R1. Recall that H1 and H2 are the N- and C-terminal helical regions, respectively, in the native structure (Figure 1a in the main text). There are fifteen charged residues in the two regions. First, we assigned charged sites to the N position for Lys, and the midpoint of two N, O, and O atoms for Arg, Asp, and Glu, respectively. Second, we calculated the geometrical centers of H1 and H2, denoted as CN and CC, respectively, by using the heavy-atomic positions. Then, we computed two unit vectors, u N and u C, from a charged site to CN and CC, respectively, and computed the 3 inner product ( IP u N uC ) of the two unit vectors. Figure 1 supports understanding of the procedure. Figure 1. Schematic drawing to explain an inner product ( IP ). Gray and black cylinders are H1 and H2, respectively. White circles CN and CC represent the geometrical centers of H1 and H2, respectively. Filled circle represents the charged H1 or H2. Arrows represent unit vectors site of a charged residue, which belongs to either charged site to CN and CC, respectively. Then, the inner product IP is defined as panels show situation where IP 0 and IP 0 , respectively. u N and u C defined from the IP u N uC . The left and right site locates in the zone between H1 and H2, the inner product takes on a negative When the charged value. We then defined the distance between H1 and H2 as NN NC dNC d i, j i=1 j =1 N NN C , (3) where NN (or NC) is the number of C atoms in H1 (or H2), and d(i,j) is the distance between the ith and jth C atoms belonging to H1 and H2, respectively. To express the proximity of H1 and H2, the CN-CC distance or the H1-H2 minimum distance is an alternative definition. However, dNC is an appropriate quantity to integrate both the overall and local proximities. A conformation is characterized by a dNC value and fifteen inner products. Then, given a conformational ensemble, we analyzed the relation between dNC and the averaged inner products, IP d NC , over that have a particular value of d . conformations NC Figures 2a-d plot the inner products IP d NC ionic charged sites, between the fifteen averaged over conformations in the 300 K dataset at a given distance dNC. The inner products were predominantly positive for all of the fifteen sites. If the two helical regions were stabilized by 4 electrostatic interactions between oppositely charged amino acids in the two regions, the inner products for such pairs must be predominantly negative. However, such correlations were not observed. Thus, we conclude that the electrostatic interactions by the charged residues do not significantly contribute to the attraction between the H1 and H2 regions. Figure 2. Inner products IP dNC. Four panels show IP for ionic charged sites averaged over conformations that have a particular value of d NC d NC of (a) positively charged sites in the H1 region, (b) negatively charged sites in H1, (c) positively charged sites in the H2 region, and (d) negatively charged sites in H2. Helix-helix dipole interaction It is known that an -helix has an electric dipole moment resulting from local dipoles pointing along peptide planes,3 and that proteins exploit these macroscopic dipoles in order to stabilize folds, bind ligands, and form selective channels.4,5 The native structural topology of the current protein, EPRS-R1, is a typical fold with two anti-parallel -helices, which may be stabilized by helical dipoles. Thus, although there is no significant ionic charge distribution that would directly support the folding of EPRS-R1 as shown above, the approach of the two anti-parallel helices may be influenced by the anti-parallel helical dipoles. Dependency of Nclust on Dthre The resultant number of clusters Nclust depends on the threshold Dthre to quit the merger step, as 5 explained in the main text. Figure 3 plots the relation between Dthre and Nclust. We observed two inflection points in the dependency of Nclust on Dthre : Dthre 20 and 32. Thus, the structural clustering of the 300 K dataset is executable with using the two thresholds. As reported in the main text, we obtained 20 structural clusters with Dthre 32 where the largest cluster had a proportion of 34% of the 300 K dataset. In contrast, we obtained 107 clusters with Dthre 20, where the largest cluster had a proportion of 5%. Thus, Dthre 20 provides no major cluster that determines the main feature of the cluster distribution. We utilized Dthre 32 for the clustering in this study. We examined other clustering methods such as single-linkage and complete-linkage methods. Both methods provided qualitatively similar results to the current one with modulating the Dthre value. Figure 3. Dependency of Nclust on Dthre. Red open circles are raw data from the clustering, and blue broken lines are fit line on the raw data. References 1. Yeh, I.C., Lee, M.S., and Olson, M.A. (2008) Calculation of protein heat capacity from replica-exchange molecular dynamics simulations with different implicit solvent models. J Phys Chem 112: 15064-15073. 2. Cheung, M.S., Garcia, A.E., and Onuchic, J.N. (2002) Protein folding mediated by solvation: water expulsion and formation of the hydrophobic core occur after the structural collapse. Proc Natl Acad Sci USA 99: 685-690. 3. Wada, A. (1976) The alpha-helix as an electric macro-dipole. Adv Biophys: 1. 4. Murata, K., Mitsuoka, K., Hirai, T., Walz, T., Agre, P., Heymann, J.B., Engel, A., and Fujiyoshi, Y. (2000) Structural determinants of water permeation through aquaporin-1. Nature 407: 599-605. 5. Nakamura, H. (1996) Roles of electrostatic interaction in proteins. Q Rev Biophys 29: 1-90. 6 7