Supplementary Information for: Extracting representative structures from protein conformational ensembles Alberto Perez1, Arijit Roy1, Koushik Kasavajhala2, Amy Wagaman4, Ken A. Dill1,2,3, Justin L. MacCallum1,5* 1 Laufer Center for Physical and Quantitative Biology, Stony Brook University of Chemistry, Stony Brook University 3 Department of Physics, Stony Brook University 4 Department of Mathematics and Statistics, Amherst College 5 Department of Chemistry, University of Calgary 2 Department * To whom correspondence should be addressed. Email: justin.maccallum@me.com Supplementary Tables Table S1: This table reports if the differences observed between averaging methods are significant, as judged by the two-sided sign test with a p-value of 0.01. A ‘+’ indicates a result is significant. A ‘-‘ indicates it is not. For each entry, the first symbol indicates RMSD, the second indicates GDT-HA, and the third indicates CAD-score. GÅ Gauss1 Native Refinement MD + +++ ++- CartAvg +++ vs Centroid NetAvg +++ + +++ -++ vs Centroid CartAvg +++ + ----vs NetAvg 1 For Gauss models, only RMSD was compared Table S2: Experimental structures used to generate the Native MD dataset. PDB ID 3pnx 3npp 2kyy 3n70 2kyw 3nrf 3na2 PDB ID 3n6y 3nhv 3ni8 3noh 3voq 3nkl 3nrl Table S3: Applications of NetAvg and CartAvg to NMR ensembles. Each entry is the backbone RMSD (Å) compared to the corresponding x-ray structure. * Corresponds to structures solved with solid state NMR. X-ray Pdb ID 2zw1 NMR Pdb ID 3gb1 First model 0.67 Average 0.72 CartAvg 0.64 NetAvg 0.64 1ubq 1d3z 2jzz* 2l3z* 0.53 2.21 3.22 0.95 2.62 2.72 0.56 2.06 2.32 0.53 2.05 2.32 6lyz 1e8l 1.46 1.49 1.41 1.36 Supplementary Methods Details of the NetAvg procedure The algorithm of NetAvg begins by iterative superposition of all structures to the mean structure (described above). Next, each atom is considered individually: 1. Create a graph where each position sampled by the atom is a node. There will be ððððððð nodes. 2. Compute the distance between all pairs of nodes as the Euclidean distance between the corresponding atom positions. 3. Connect each node to its nearest k neighbors with directed edges, where k is a parameter of the algorithm. We use ð = 32 in this work. 4. If two nodes are connected by edges in both directions, replace these edges with an undirected edge. Otherwise, remove any connections between the two nodes. 5. Find the largest connected component of the graph. 6. Find the node in the largest connected component with the highest eigenvector centrality1. 7. Compute the average position for the atom as the mean position of this node and all of the nodes connected to it (up to k other nodes). All graph computations were performed using the “networkX” python package. The graph is built by connecting each node with its k-nearest neighbors. We found a value of ð = 32 worked well for a variety of systems (data not shown). If k becomes too small, the resulting averages become very noisy and unreliable. As k becomes large, all nodes become connected and the resulting average is the same as the CartAvg. The eigenvector centrality calculation was performed using the “eigenvector_centrality_numpy” function from networkX. Pruning Protocol Given an ensemble, can we produce a more robust description of it by reducing the number of outlier structures? To answer this question we perform a two stage protocol: (A) we first cluster our ensemble by using self-similarity (measured by RMSD) and (B) remove structures from the cluster that are outliers. For the clustering we use standard clustering protocols such as the ones implemented in ptraj2. For the pruning step we have to define 3 parameters: ï· ï· Dimension for Multidimensional Scaling (MDS) projection to low dimensional space. Trim factor (trim): either a multiplication factor (for dim>1) or % of points you want to remove (for dim=1). ï· NN: number of neighbors that need to occur in neighborhood of point in the low-dimensional space for the point to be retained in the cluster (only used for dim>1). For each cluster, the distance matrix for the subset of observations in the cluster is extracted and used to project the points onto a lower dimension space using MDS. For projections to 1D, the trim factor is used to prune a percentage of points from each end of the projection. For example, a trim factor of 10 would prune 5 percent of the observations at each end based on their coordinates in the projected space. For projections to 2D or higher, the distance matrix for the subset of observations in the cluster is computed, the points projected down, and the standard deviation of all the pairwise distances for points within the cluster is computed (SD). The trim factor (trim) is multiplied by SD, and an epsilon ball with epsilon = trim*SD is placed around each point in the low dimension space. Points must have NN neighbors within their trim*SD ball or they are pruned out of the cluster (where NN is a parameter specifying the threshold number of neighbors to keep a point in the dataset). The algorithm returns the subset of the original cluster that was retained after pruning. From this subset we can then use both methods described in the main text to average and minimize; we call this new methods with pruning PNetAvg and PCartAvg. Figure S4 shows the effect pruning had in the most populated clusters of a subset of five proteins from CASP 10 (PDB IDs:4fr9, 4fj6, 4epz, and CASP IDs: T0709, Tr750 (not yet released in PDB)). Despite our limited success with the pruning algorithm, we feel there might be some value in exploring more options in the future. Supplementary Figures Figure S1: Both NetAvg and CartAvg improve over the medoid structure on the GÅ model systems. Overall, NetAvg performs better than CartAvg. Figure S2: Both NetAvg and CartAvg improve over the medoid structure on the Native MD dataset. Figure S3: NetAvg and CartAvg give slight improvements on the Refinement dataset. Figure S4: Both PCartAvg and PNetAvg show no clear improvement over the NetAvg and CartAvg methods previously described. Supplementary References 1. 2. Newman, M. The mathematics of networks. The new palgrave encyclopedia of economics (2008). Roe, D. R. & Cheatham, T. E., III. PTRAJ and CPPTRAJ: Software for Processing and Analysis of Molecular Dynamics Trajectory Data. J Chem Theory Comput 9, 3084–3095 (2013).