Automated robust and accurate assignment of protein resonances for solid state NMR Jakob Toudahl Nielsena*, Natalia Kulminskaya, Morten Bjerring and Niels Chr. Nielsena Supplementary Material Experimental procedures MAS Solid-state NMR spectra of the three samples GB1, Ubiquitin, and CsmA were obtained. The GB1 and ubiquitin samples consist of hydrated microcrystals, while the CsmA sample was a truly heterogeneous preparation as described elsewhere (Kulminskaya et al. 2012). All spectra were recorded using a Bruker 700 MHz Avance II spectromter at 12 kHz spinning using standard 2.5 mm (GB1) and 4 mm (Ubiquitin and CsmA) triple resonance Bruker probes. For heteronuclear transfers 5-8 ms optimized adiabatic DCP elements were used while 13C-13C transfers were carried out using 20 ms DARR. Ramped cross- polarization from 1H to 15N or 13C was, in all experiments, carried out at optimized Hartman-Hahn conditions. For all spectra 80-100 kHz SPINAL-64 1H decoupling was applied during direct and indirect acquisition periods, and acquisition times of around 30 ms were used. The DARR spectra were acquired with 400 points and 200 ppm spectral widths in the indirect dimensions while indirect 15N dimensions in the 3D experiments were covered with 38 ppm/24 points (GB1), 40 ppm/28 points (ubiquitin), and 32 ppm/24 points (CsmA). The indirect Cο‘ dimensions used 34 ppm/44 points (GB1), 28 ppm/48 points (Ubiquitin), and 26 ppm/42 points (CsmA), and for C´ the numbers were 16 ppm/32 points (GB1), 16 ppm/28 points (Ubiquitin), and 12 ppm/28 points (CsmA), except for CAN(CO)CX for Ubiquitin were 32 and 64 points for N and Cο‘ dimensions, respectively, were used. For GB1, the DARR spectrum was obtained with 16 scans per increment, and the 3D experiments used 24 (NCACX), 72 (NCOCX), and 128 (CONCA) scans. For Ubiquitin, the number of scans was 88 (DARR), 152 (NCACX), 160 (NCOCX), 224 (CONCA) and 80 (CAN(CO)CX) and the CsmA spectra used 96 (DARR), 180 (NCACX), 240 (NCOCX), and 288 (CONCA) scans. Peak picking was performed manually in all spectra using Sparky (Goddard and Kneller). Sparky was also used to pick all peaks in the spectra of GB1 by an automatic procedure. Firstly, all peaks were picked automatically in the 2D DARR and the diagonal peaks and peaks corresponding to spinning side bands were removed. Secondly, all 3D spectra were picked by the automatic “restricted peak picking” procedure in Sparky, allowing peaks to be picked within ± 0.3 ppm range from reference coordinates using the automatically picked peaks in the 2D DARR as reference coordinates for the carbon dimensions. For CONCA, the shared N and Cο‘ shifts from the (automatically picked) NCACX spectrum was used as reference. Algorithms used in GAMES_ASSIGN Algorithm 1.3: Guided completion of peak clusters The third step in phase 1 aims at completing the peak clusters. Here were define a peak cluster, C, as the union of subsets, si, of peaks from different experiments, i π πΆ=β π π π=1 where some of the sets might be empty. Depending on the experiment type there is an expectation for the number of peaks in the subset: either |si| = 1, as is the case for the subset corresponding to CONCA, or |si| ≥ 1, as for NCACX (where CX denotes an unspecified carbon) or |si| = 1 or 0, as for NCACB, which can have zero peaks for a glycine. In the first step of phase I, as described above, a peak is only appended to a peak cluster and two peak clusters are only merged if the above expectations of peak counts in subsets are obeyed in the resulting new peak cluster. This means that two clusters are not merged if they both have a NCACB peak. This step and A1.2 aim at completing the clusters, i.e. all experiment subsets should be non-empty except for the case with expectation |si| = 1 or 0. The object of A1.3 is, for a given peak cluster with a missing experiment (an empty experiment subset si,), to search for a peak, pin, from experiment, i, matching the cluster optimally. The energy of pairing is defined as the normal pairing energy between the peak, pin, and another peak, ππ ∈ π π , π ≠ π from the peak cluster sharing the most axes with pin. The lines are often broad in solidstate NMR spectra, and consequently the peaks overlap. Therefore, for this step, clusters are not merged even if pin is already present in another cluster, in order to allow the possibility of a peak to be present in two or more spin systems at the same time. Typing energy for non-specifically assigned atoms The definition of the typing energy for non-specifically assigned atoms (CX), TN = TN0 + TNR is visualized in Fig. 3 in the main text. Firstly, TN0 is calculated looping trough the resonances (see Fig. 3, solid arrows) for each order in the SSY separately, finding the reference atoms, K, which explains each resonance the best: ππ0 = ∑ min π‘ππ , πππ πΎ = β argmin π‘ππ π π π π where tjk is defined above. Furthermore, an additional contribution (T’j = ln(0.3)) is also added to TN0 if the same reference atom is assigned to different resonances in the SSY. Secondly, the contribution, TNR, is calculated looping through the remaining expected atoms, π ∉ πΎ, i.e. expected atoms with no assignments, see Figure 3 dotted arrows: πππ = ∑ π∉πΎ min π‘ππ π For this calculation it is tested for each atom which is the best matching resonance in the SSY. Again, as for the specific atoms, if a matching resonance is not found (see Fig. 3 main text, dotted arrow marked with a ‘$’), a penalty value π‘ππ = ππππ π is used to account for a missing peak (increasingly smaller values were used for emiss for carbons further down the side chain). Finally, the two contributions are summed to define the total typing energy for the nonspecifically assigned resonances, ππ = ππ0 + πππ . Note that finding the optimal pairing of assignments between members of the two sets is a QAP itself. Our solution to the problem is, however, much faster than a full QAP optimization and often finds the optimal solution. Algorithm 3.2: Missing atom assignments The next step identifies the still non-assigned resonances in the protein by looking for missing atom resonances in each of the SSYs. For each missing atom assignment the possible pairing with peaks are evaluated and subjected to the MENOVAR algorithm. The possibility of assigning the atom resonance as missing in the data is included with a probability, which is higher the longer the atom is away from the backbone. The probabilities used here are, pmiss = 0.001, 0.01, 0.1, 0.5, 0.7, 0.9 for ο’, ο§, ο€, ο₯ο¬ οΊ, and ο¨ carbons, respectively, and 0.001 for backbone atoms. If the candidate matching peak has already been assigned in phase II, the probabilities for pairing are derived from the histogram statistics as above. Alternatively, the energy related to the pairing is defined as the sum of the coordinate-database value matching energy contribution, tjk, and the resonancecoordinate match energy, Ec, as defined above in reference to an atom entry in an assigned SSY (Eq. 2). Algorithm 3.3: Assigning non-assigned peaks In the third step, the opposite procedure of 3.2 is followed; peaks that were still not assigned in the first two steps are matched with the already assigned SSYs. The energy for pairing is the resonance-coordinate match energy with the exception that the contribution, en, from a certain dimension, n, of the peak, in the case that the corresponding experiment has no specific atom related (as is the case for the last dimension for e.g. NCACX peaks), is defined as: ππ = min ππ (πΏππππ ) π where ο€peak is the chemical shift for the aligned axis of the peak and k loops through all side chain carbons in the residue and 1 2 ππ (πΏππππ ) = log(ππ ) + ((πΏππππ − ππ )⁄ππ ) 2 if carbon atom k is already assigned in the SSY where ο·k is the SSY resonance assignment and ο³k is the estimated uncertainty for the experiment for the axis. Alternatively, if the carbon atom is not assigned yet, 1 2 ππ (πΏππππ ) = log(π π ) + ((πΏππππ − ππ )⁄π π ) 2 where dk and sk are average and standard deviation, respectively, for database chemical shifts. Quality assessment of resonance assignments A resonance assignment is a collection of sequence specific assigned spin systems (SSYs). Each such assigned SSY represent a collection of assigned atoms. The assigned of the atoms are derived from the assigned peaks in the SSY. By the end of the algorithm, we merge the values from neighboring SSYs into the first SSY. I.e. for a SSY assigned to residue i, SSY(i), we merge values for SSY(i+1) with residue order, ο = -1, onto SSY(i) with residue order ο = 0 and similarly, SSY(i-1) with ο = 0 onto SSY(i) with ο = -1. GAMES_ASSIGN produces an ensemble of solutions for each individual assignment (20 solutions here). Based on this ensemble, we create one single merged SSY (mSSY) for each residue position by including all peak assignments from the 20 individual SSYs (including neighbor assignments as described above). If a peak would be present multiple times in the mSSY, the corresponding chemical shift value are repeated as many times for the definition of median chemical shift and the standard deviation. Based on this mSSY, we define a residue based quality parameter, Qres, based on all assigned values: ππππ (π) = 0.5(6.5 − π (π) – π(π) + √π(π)) − ln(1.02 − πππ’ππ‘ (π)) + 0.2π(π) where T(i) and S(i) are the SSY typing and SSY quality energies as defined above and in the main text Eqs. 3,4 and Eq. 7, respectively, R(i) is the rank of the mSSY defined as the number of assigned backbone atoms (i.e. maximum of 5) + 0.7 if Cο’ is assigned and + 0.6 if the residue is Gly. ο£(i) is the completeness of the mSSY defined as a Boolean value, which is 1 if all atoms are assigned including all side chain atoms, and 0 if not. Finally, fmult(i) is the fraction of peaks, in the mSSY, which is also present in another mSSY for another residue position. Smaller values correspond to better validated assignments for the mSSY, where values around 2.5 are typical. Furthermore, each assigned atom, m, in residue i, has a set of chemical shift values, ο€merge(m,i), from the corresponding mSSY, where there will typically be ca. 200 values. This set is used to define an atom based estimated precision, πππ‘ππ (π, π) = π (πΏπππππ (π, π)) β 200/π where s denotes the sample standard deviation from the mean in the in combined set and N denotes the number of assignments in the set. A phenomenological quality assessment parameter for a specific atom is calculated as the inverse of the clipped product between Qres(i) and qatom(m,i): ππ = 7.5 π (πππ‘ππ (π, π)) β π (0.1ππππ (π)) π₯ ππ 0.1 < π₯ < 1.0 π€βπππ π (π₯) = { 0.1 ππ π₯ ≤ 0.1 1.0 ππ π₯ ≥ 1.0 Assignments with Qa > 80 are considered as validated, otherwise, the assignment is deemed tentative. Deconvolution of side chain resonance chemical shift sets. Algorithm 1.4 in the main text describes how a group of aligned peaks associated into clusters are interpreted as dipeptide spin systems (SSYs). Since many ssNMR experiments have a dimension with an unspecific atom type, the peaks group will contain a merged set of subsets of chemical shifts for different side chain carbon atoms. The algorithm used here aims to deconvolute this merged set a divide the values into disjoint subsets corresponding to the individual side chain atoms. The algorithm is initialized by sorting numerically all chemical shift values. The values are looped through with increasing chemical shifts. For each new value, ο€i, it is consider whether this value will be added to the current group (side chain carbon) if it is close enough to the minimum value in the current group, ο€0, or initialize a new group with starting value ο€0 = ο€i. This is accomplished by evaluating a test probability: π = exp (− (δ0 − δi )2 ), 2π 2 π= π √1/πππππ’π + 1/2 where Ngroup is the number of chemical shifts in the current group and ο¬ is the estimated uncertainty of the peak positions. A random number between 0 and 1 is drawn, and if p is less than this number, the chemical shift is added to the current group, else a new group is initialized. Figure Legends Figure S1: Example of peak picked data showing spectra used for the assignments, picked peaks are marked with black crosses. a aliphatic region of the DARR spectrum of GB1 acquired with a mixing time of 20 ms. b A representative plane with fixed 15N chemical shift of NCACX for Ubiquitin. Figure S2: Visualization of the simulated spectra with increasing line width for GB1. An excerpt of the 13C-13C 2D DARR spectrum for GB1 is shown. a, b, c, d and e show the simulated spectrum for Gaussian line widths in both dimensions of ο³ = 0.3, 0.4, 0.5, 0.7 and 1.0 ppm (Eq. 13), respectively, corresponding to FWHM = 2.355ο³.. Green and black dots denote picked peaks due to merged signals, black dots highlight significantly shifted signals. Red and magenta dots denote non-overlapped peaks, where shifted peak positions are highlighted in magenta. All other simulation parameters are fixed and set to default values(Nielsen et al. 2014). The experimental spectrum is shown in f for comparison.