Automated robust and accurate assignment of protein resonances

advertisement
Automated robust and accurate assignment of protein
resonances for solid state NMR
Jakob Toudahl Nielsena*, Natalia Kulminskaya, Morten Bjerring and Niels Chr.
Nielsena
Supplementary Material
Experimental procedures
MAS Solid-state NMR spectra of the three samples GB1, Ubiquitin, and CsmA
were obtained. The GB1 and ubiquitin samples consist of hydrated microcrystals,
while the CsmA sample was a truly heterogeneous preparation as described
elsewhere (Kulminskaya et al. 2012). All spectra were recorded using a Bruker
700 MHz Avance II spectromter at 12 kHz spinning using standard 2.5 mm (GB1)
and 4 mm (Ubiquitin and CsmA) triple resonance Bruker probes. For
heteronuclear transfers 5-8 ms optimized adiabatic DCP elements were used
while
13C-13C
transfers were carried out using 20 ms DARR. Ramped cross-
polarization from 1H to
15N
or
13C
was, in all experiments, carried out at
optimized Hartman-Hahn conditions. For all spectra 80-100 kHz SPINAL-64 1H
decoupling was applied during direct and indirect acquisition periods, and
acquisition times of around 30 ms were used.
The DARR spectra were acquired with 400 points and 200 ppm spectral widths
in the indirect dimensions while indirect
15N
dimensions in the 3D experiments
were covered with 38 ppm/24 points (GB1), 40 ppm/28 points (ubiquitin), and
32 ppm/24 points (CsmA). The indirect C dimensions used 34 ppm/44 points
(GB1), 28 ppm/48 points (Ubiquitin), and 26 ppm/42 points (CsmA), and for C´
the numbers were 16 ppm/32 points (GB1), 16 ppm/28 points (Ubiquitin), and
12 ppm/28 points (CsmA), except for CAN(CO)CX for Ubiquitin were 32 and 64
points for N and C dimensions, respectively, were used.
For GB1, the DARR spectrum was obtained with 16 scans per increment, and the
3D experiments used 24 (NCACX), 72 (NCOCX), and 128 (CONCA) scans. For
Ubiquitin, the number of scans was 88 (DARR), 152 (NCACX), 160 (NCOCX), 224
(CONCA) and 80 (CAN(CO)CX) and the CsmA spectra used 96 (DARR), 180
(NCACX), 240 (NCOCX), and 288 (CONCA) scans.
Peak picking was performed manually in all spectra using Sparky (Goddard and
Kneller). Sparky was also used to pick all peaks in the spectra of GB1 by an
automatic procedure. Firstly, all peaks were picked automatically in the 2D
DARR and the diagonal peaks and peaks corresponding to spinning side bands
were removed. Secondly, all 3D spectra were picked by the automatic “restricted
peak picking” procedure in Sparky, allowing peaks to be picked within ± 0.3 ppm
range from reference coordinates using the automatically picked peaks in the 2D
DARR as reference coordinates for the carbon dimensions. For CONCA, the
shared N and C shifts from the (automatically picked) NCACX spectrum was
used as reference.
Algorithms used in GAMES_ASSIGN
Algorithm 1.3: Guided completion of peak clusters
The third step in phase 1 aims at completing the peak clusters. Here were define
a peak cluster, C, as the union of subsets, si, of peaks from different experiments, i
𝑁
𝐢=⋃
𝑠𝑖
𝑖=1
where some of the sets might be empty. Depending on the experiment type there
is an expectation for the number of peaks in the subset: either |si| = 1, as is the
case for the subset corresponding to CONCA, or |si| ≥ 1, as for NCACX (where CX
denotes an unspecified carbon) or |si| = 1 or 0, as for NCACB, which can have
zero peaks for a glycine. In the first step of phase I, as described above, a peak is
only appended to a peak cluster and two peak clusters are only merged if the
above expectations of peak counts in subsets are obeyed in the resulting new
peak cluster. This means that two clusters are not merged if they both have a
NCACB peak.
This step and A1.2 aim at completing the clusters, i.e. all experiment subsets
should be non-empty except for the case with expectation |si| = 1 or 0.
The object of A1.3 is, for a given peak cluster with a missing experiment (an
empty experiment subset si,), to search for a peak, pin, from experiment, i,
matching the cluster optimally. The energy of pairing is defined as the normal
pairing energy between the peak, pin, and another peak, 𝑝𝑗 ∈ 𝑠𝑗 , 𝑗 ≠ 𝑖 from the
peak cluster sharing the most axes with pin. The lines are often broad in solidstate NMR spectra, and consequently the peaks overlap. Therefore, for this step,
clusters are not merged even if pin is already present in another cluster, in order
to allow the possibility of a peak to be present in two or more spin systems at the
same time.
Typing energy for non-specifically assigned atoms
The definition of the typing energy for non-specifically assigned atoms (CX), TN =
TN0 + TNR is visualized in Fig. 3 in the main text. Firstly, TN0 is calculated looping
trough the resonances (see Fig. 3, solid arrows) for each order in the SSY
separately, finding the reference atoms, K, which explains each resonance the
best:
𝑇𝑁0 = ∑ min π‘‘π‘—π‘˜ , π‘Žπ‘›π‘‘ 𝐾 = ⋃ argmin π‘‘π‘—π‘˜
𝑗
π‘˜
𝑗
π‘˜
where tjk is defined above. Furthermore, an additional contribution (T’j = ln(0.3)) is also added to TN0 if the same reference atom is assigned to different
resonances in the SSY. Secondly, the contribution, TNR, is calculated looping
through the remaining expected atoms, π‘˜ ∉ 𝐾, i.e. expected atoms with no
assignments, see Figure 3 dotted arrows:
𝑇𝑁𝑅 = ∑
π‘˜∉𝐾
min π‘‘π‘—π‘˜
𝑗
For this calculation it is tested for each atom which is the best matching
resonance in the SSY. Again, as for the specific atoms, if a matching resonance is
not found (see Fig. 3 main text, dotted arrow marked with a ‘$’), a penalty value
π‘‘π‘—π‘˜ = π‘’π‘šπ‘–π‘ π‘  is used to account for a missing peak (increasingly smaller values
were used for emiss for carbons further down the side chain). Finally, the two
contributions are summed to define the total typing energy for the nonspecifically assigned resonances, 𝑇𝑁 = 𝑇𝑁0 + 𝑇𝑁𝑅 . Note that finding the optimal
pairing of assignments between members of the two sets is a QAP itself. Our
solution to the problem is, however, much faster than a full QAP optimization
and often finds the optimal solution.
Algorithm 3.2: Missing atom assignments
The next step identifies the still non-assigned resonances in the protein by
looking for missing atom resonances in each of the SSYs. For each missing atom
assignment the possible pairing with peaks are evaluated and subjected to the
MENOVAR algorithm. The possibility of assigning the atom resonance as missing
in the data is included with a probability, which is higher the longer the atom is
away from the backbone. The probabilities used here are, pmiss = 0.001, 0.01, 0.1,
0.5, 0.7, 0.9 for , , , ο₯ , and  carbons, respectively, and 0.001 for backbone
atoms. If the candidate matching peak has already been assigned in phase II, the
probabilities for pairing are derived from the histogram statistics as above.
Alternatively, the energy related to the pairing is defined as the sum of the
coordinate-database value matching energy contribution, tjk, and the resonancecoordinate match energy, Ec, as defined above in reference to an atom entry in an
assigned SSY (Eq. 2).
Algorithm 3.3: Assigning non-assigned peaks
In the third step, the opposite procedure of 3.2 is followed; peaks that were still
not assigned in the first two steps are matched with the already assigned SSYs.
The energy for pairing is the resonance-coordinate match energy with the
exception that the contribution, en, from a certain dimension, n, of the peak, in
the case that the corresponding experiment has no specific atom related (as is
the case for the last dimension for e.g. NCACX peaks), is defined as:
𝑒𝑛 = min π‘žπ‘˜ (π›Ώπ‘π‘’π‘Žπ‘˜ )
π‘˜
where peak is the chemical shift for the aligned axis of the peak and k loops
through all side chain carbons in the residue and
1
2
π‘žπ‘˜ (π›Ώπ‘π‘’π‘Žπ‘˜ ) = log(πœŽπ‘˜ ) + ((π›Ώπ‘π‘’π‘Žπ‘˜ − πœ”π‘˜ )⁄πœŽπ‘˜ )
2
if carbon atom k is already assigned in the SSY where k is the SSY resonance
assignment and k is the estimated uncertainty for the experiment for the axis.
Alternatively, if the carbon atom is not assigned yet,
1
2
π‘žπ‘› (π›Ώπ‘π‘’π‘Žπ‘˜ ) = log(π‘ π‘˜ ) + ((π›Ώπ‘π‘’π‘Žπ‘˜ − π‘‘π‘˜ )⁄π‘ π‘˜ )
2
where dk and sk are average and standard deviation, respectively, for database
chemical shifts.
Quality assessment of resonance assignments
A resonance assignment is a collection of sequence specific assigned spin
systems (SSYs). Each such assigned SSY represent a collection of assigned atoms.
The assigned of the atoms are derived from the assigned peaks in the SSY. By the
end of the algorithm, we merge the values from neighboring SSYs into the first
SSY. I.e. for a SSY assigned to residue i, SSY(i), we merge values for SSY(i+1) with
residue order,  = -1, onto SSY(i) with residue order  = 0 and similarly, SSY(i-1)
with  = 0 onto SSY(i) with  = -1. GAMES_ASSIGN produces an ensemble of
solutions for each individual assignment (20 solutions here). Based on this
ensemble, we create one single merged SSY (mSSY) for each residue position by
including all peak assignments from the 20 individual SSYs (including neighbor
assignments as described above). If a peak would be present multiple times in
the mSSY, the corresponding chemical shift value are repeated as many times for
the definition of median chemical shift and the standard deviation. Based on this
mSSY, we define a residue based quality parameter, Qres, based on all assigned
values:
π‘„π‘Ÿπ‘’π‘  (𝑖) = 0.5(6.5 − 𝑅(𝑖) – πœ’(𝑖) + √𝑆(𝑖)) − ln(1.02 − π‘“π‘šπ‘’π‘™π‘‘ (𝑖)) + 0.2𝑇(𝑖)
where T(i) and S(i) are the SSY typing and SSY quality energies as defined above
and in the main text Eqs. 3,4 and Eq. 7, respectively, R(i) is the rank of the mSSY
defined as the number of assigned backbone atoms (i.e. maximum of 5) + 0.7 if
C is assigned and + 0.6 if the residue is Gly. (i) is the completeness of the mSSY
defined as a Boolean value, which is 1 if all atoms are assigned including all side
chain atoms, and 0 if not. Finally, fmult(i) is the fraction of peaks, in the mSSY,
which is also present in another mSSY for another residue position. Smaller
values correspond to better validated assignments for the mSSY, where values
around 2.5 are typical.
Furthermore, each assigned atom, m, in residue i, has a set of chemical shift
values, merge(m,i), from the corresponding mSSY, where there will typically be ca.
200 values. This set is used to define an atom based estimated precision,
π‘žπ‘Žπ‘‘π‘œπ‘š (π‘š, 𝑖) = 𝑠 (π›Ώπ‘šπ‘’π‘Ÿπ‘”π‘’ (π‘š, 𝑖)) βˆ™ 200/𝑁
where s denotes the sample standard deviation from the mean in the in
combined set and N denotes the number of assignments in the set. A
phenomenological quality assessment parameter for a specific atom is calculated
as the inverse of the clipped product between Qres(i) and qatom(m,i):
π‘„π‘Ž =
7.5
πœ…(π‘žπ‘Žπ‘‘π‘œπ‘š (π‘š, 𝑖)) βˆ™ πœ…(0.1π‘„π‘Ÿπ‘’π‘  (𝑖))
π‘₯ 𝑖𝑓 0.1 < π‘₯ < 1.0
π‘€β„Žπ‘’π‘Ÿπ‘’ πœ…(π‘₯) = { 0.1 𝑖𝑓 π‘₯ ≤ 0.1
1.0 𝑖𝑓 π‘₯ ≥ 1.0
Assignments with Qa > 80 are considered as validated, otherwise, the assignment
is deemed tentative.
Deconvolution of side chain resonance chemical shift sets.
Algorithm 1.4 in the main text describes how a group of aligned peaks associated
into clusters are interpreted as dipeptide spin systems (SSYs). Since many
ssNMR experiments have a dimension with an unspecific atom type, the peaks
group will contain a merged set of subsets of chemical shifts for different side
chain carbon atoms. The algorithm used here aims to deconvolute this merged
set a divide the values into disjoint subsets corresponding to the individual side
chain atoms. The algorithm is initialized by sorting numerically all chemical shift
values. The values are looped through with increasing chemical shifts. For each
new value, i, it is consider whether this value will be added to the current group
(side chain carbon) if it is close enough to the minimum value in the current
group, 0, or initialize a new group with starting value 0 = i. This is
accomplished by evaluating a test probability:
𝑝 = exp (−
(δ0 − δi )2
),
2𝜎 2
𝜎=
πœ†
√1/π‘π‘”π‘Ÿπ‘œπ‘’π‘ + 1/2
where Ngroup is the number of chemical shifts in the current group and  is the
estimated uncertainty of the peak positions. A random number between 0 and 1
is drawn, and if p is less than this number, the chemical shift is added to the
current group, else a new group is initialized.
Figure Legends
Figure S1:
Example of peak picked data showing spectra used for the assignments, picked
peaks are marked with black crosses. a aliphatic region of the DARR spectrum of
GB1 acquired with a mixing time of 20 ms. b A representative plane with fixed
15N
chemical shift of NCACX for Ubiquitin.
Figure S2: Visualization of the simulated spectra with increasing line width for
GB1. An excerpt of the 13C-13C 2D DARR spectrum for GB1 is shown. a, b, c, d and e show
the simulated spectrum for Gaussian line widths in both dimensions of  = 0.3, 0.4, 0.5,
0.7 and 1.0 ppm (Eq. 13), respectively, corresponding to FWHM = 2.355.. Green and
black dots denote picked peaks due to merged signals, black dots highlight significantly
shifted signals. Red and magenta dots denote non-overlapped peaks, where shifted peak
positions are highlighted in magenta. All other simulation parameters are fixed and set
to default values(Nielsen et al. 2014). The experimental spectrum is shown in f for
comparison.
Download