Supplemental Information Improved Peak Detection and

advertisement
Supplemental Information
Improved Peak Detection and Deconvolution of Native Electrospray Mass Spectra
from Large Protein Complexes
Jonathan Lu†, Michael J. Trnka†, Soung-Hun Roh, Philip J. J. Robinson, Carrie Shiau,
Danica Galonic Fujimori, Wah Chiu, Alma L. Burlingame, and Shenheng Guan*
Figure S1. Schematic of Software Workflow.
A.
B.
12mer
10mer
RNA pol II 10mer (∆rpb4/7)
RNA pol II 12mer
Measured: 470,170 Da
Theoretical: 469,682 Da
Measured: 514,728 Da
Theoretical: 514,154 Da
C.
+2% DMSO
Figure S2. A) Native MS of RNA pol II in the absence of -amanitin showing charge
distributions for decameric and dodecameric pol II complexes centered on 49+ and 50+
charge states. B) The deconvoluted mass values match the expected within 0.1%. C)
Addition of 2% DMSO to the sample induces: the addition of ~15 protons, wider
distribution of charges, and the partial dissociation of Rpb 4 and Rpb7 from the
assembly.
Figure S3. Performance of different peak detection programs on spectrum of H3K9-Me1
mononucleosome analog. Top panel: the Find Peak Lists program in the Massign [1]
suite was applied with threshold of 6% of the maximum, a reduce factor of 1, and the
“rise stepwise” option enabled from 5% to 95% with stepsize of 10%. Massign detects
shoulder peaks either at either lower or higher m/z but not both for the main charge
state. Second panel: Magtran [2] detects only the major charge state species regardless
of parameters. Third panel: Thermo Scientific Protein Deconvolution software, a MaxEnt
based method, detects the major species as well as the higher mass shoulder, but
misses the lower mass shoulder. Lower panel: PeakSeeker identifies three overlapping
species for each of the major charge states using the 2nd derivative processing method.
Figure S4. Performance of different deconvolution programs on spectrum of RNA Pol-II.
A) PeakSeeker’s output. PeakSeeker finds 8 different mass species corresponding to
decameric and dodecameric pol II complexes with up 4 molecules of amanitin bound.
Massign performed similarly to PeakSeeker. B) Protein Deconvolution’s output. Protein
Deconvolution finds the primary mass species, but misses several less abundant
charge state envelopes. Magtran (not shown) produced no output.
Figure S5. Performance of different deconvolution programs on spectrum of human
TriC. A) PeakSeeker’s output. PeakSeeker finds the major mass species of 948 kDa.
Massign’s was able to find the same mass as PeakSeeker due to its intensity modeling.
B) The same spectra as annotated by Protein Deconvolution showing the two most
abundant of ten species annotated in the mass range of the TRiC complex. C) Zerocharge spectrum produced by Protein Deconvolution. Magtran (not shown) produced no
output.
Detailed Description of the Algorithm.
1: User-defined parameters
Options and parameters are defined in a single text file. Options include: choice of
peak detection method, option to display the detected peaks, manual or automatic
processing modes, choice of smoothing and background subtracting, and display
and output saving options. Parameters include: the signal-to-noise threshold, a
general charge range (default 1-100) and a mass range (default 1-1000000 Da) of
the complex, and a mass tolerance value (default 5 m/z), and peak detection values.
The user may also choose the maximum number of charge envelopes searched for
during each iteration (default 5). The user may adjust the parameters of the scoring
function.
2: Spectrum Preparation
2.A) Preprocessing
First, the mass spectrum is averaged across several scans in order to reduce noise.
The averaged spectrum is exported to a tab-delimited file. After PeakSeeker reads in
the spectrum, the program can optionally perform background subtraction or use a
noise filter. The background at each point is estimated as the minimum intensity of a
window centered at that point, with size defined by the user. This background signal
is then smoothed with a moving average, again with size defined by the user.
2.B) Filtering/Smoothing
The user can choose to use either a moving average or a Savitzky-Golay filter [3].
The moving average replaces each intensity point with the average of all of the
intensities within a user-defined window. The Savitzky-Golay filter, on the other
hand, fits a polynomial of a given order to the points within a window, and replaces
the data point at the center of that window with the intensity determined by that
polynomial. The Savitzky-Golay filter is applied as many times as the user prefers. It
has the advantage of preserving peak shape while reducing the high-frequency
noise in the spectrum.
3: Peak Detection
Peaks are detected using one of the following three methods. Method 1 simply
detects local maxima above a fixed signal to noise ratio threshold. This method is
used with filters and the background subtract.
Method 2, adapted from the Massign [1] peak detection algorithm, adjusts this
threshold to a factor of the intensity of the most recently detected peak to allow for
detection of “shoulder” peaks.
Method 3 uses a Continuous Wavelet Transform to address the problem of noisy
spectra. The Continuous Wavelet Transform convolves the spectrum with a Mexican
Hat wavelet across a range of likely peak widths. Narrow and high-frequency noise
is filtered out while true peaks register above a signal-to-noise ratio across multiple
widths [4]. The user can define the width range, signal-to-noise threshold, mass
tolerance (for peaks to be considered the same), and minimum length of the ridgeline.
Each of these methods returns a list of peak m/z locations. Each peak’s range is
determined as existing between two local minima. This facilitates fitting of Gaussians
to capture the peak parameters.
4: Peak Overlap Detection
To detect peak overlap, we take the second derivative across a peak range, defined
as the set of consecutive peaks which share local minima. This therefore accounts
for the overlap between adjacent peaks. The second derivative of a Gaussian
shaped peak consists of two zero-crossings surrounding a central minimum. Hence,
this feature can be used to derive the number of underlying peaks in the range. We
further define a threshold for the second derivative, so that minor shape anomalies
do not register. This threshold is defined as a factor of the median of negative
second derivative points.
PeakSeeker then uses a Levenberg-Marquardt algorithm to fit Gaussian functions to
the peak range. The locations of the second derivative minima, the heights in the
peak range at those minima, and the full width at half maximum of the peak range
divided by the number of peaks serve as starting guesses for the centers, heights,
and widths of the Gaussians, respectively. This method improves over standard
peak detection methods by finding peaks that are not local maxima.
Fitting becomes difficult if too many adjacent peaks are detected. We divide the
peak range into sections of 6 peaks. Each section shares one peak. We use a half
Gaussian to fit this shared peak for each section. Then, we average the parameters
of the two half-Gaussians to obtain the peak parameters.
Finally, an optional filter selects peaks based on their widths (FWHM).
5: Charge State Assignment
The centroids of every peak are calculated as a weighted average of the data points
in their range. Charge states are then assigned as follows:
i)
ii)
iii)
iv)
Starting with the most intense peak in the spectrum in Automatic mode, or with
a user selected peak in Manual mode, iterate possible charge states to the
peak and calculate the all possible masses lying with the range specified in the
parameters.
For each of the masses, calculate the theoretical m/z values of other charge
states.
Search for peaks at these m/z values by checking if their centroids fall within
the user-defined mass tolerance. Create a set of peaks matching each
molecular mass.
Sort the masses by the number of peaks. Remove peak lists that contain less
than a number of peaks defined in the parameters file (at least 3).
v)
Simulate charge envelopes for the 5 masses with the most matching peaks.
This envelope is Gaussian-distributed over the charge domain because the
charging event is a random process. The Gaussian’s height, center, and width
are derived via a least-squares fit to the peak heights across the charge
domain. If multiple peaks were found for each theoretical m/z value, average
those peak heights. The starting guess for the center and height of the
envelope Gaussian is derived from the starting peak, and the starting guess for
the width is user-defined. If the Gaussian cannot be fit, go to the next mass.
vi) Simulate the individual peaks comprising the charge envelope in the m/z
domain. Each of these individual peaks is a Gaussian function whose height is
derived from the Gaussian describing the overall envelope (step v) and is
centered on the previously calculated m/z values. The width of each peak is
proportional to the width of the starting peak and inversely proportional to its
charge states.
vii) Score the deviation between each of the individual peaks in the simulated
charge envelope and the real signal:
mz - mz'
h - h'
S = a´
+b´
mz
h
S is the score of the simulated peaks’ fit to the actual peak, mz is the center of
the simulated peak, mz’ is the center of the real peak, his the height of the
simulated peak, h’ is the height of the real peak, and a and b are user-defined
weights. In other words, the score of each peak is a linear combination of the
relative m/z errors and height errors. If there are multiple peaks corresponding
to a simulated peak, the peak with the lowest score is chosen. The weights can
be adapted. For a spectrum with many overlapping peaks, one might prefer
relative height error over relative mass error, as the centroids of the overlapping
peaks may be closely spaced. For a spectrum with many peaks from different
components, one might prefer relative mass error over relative height error, as
unrelated peaks may still form Gaussian-shaped envelopes. The score of the
entire charge envelope is given by:
1 n
Senv = å Si ´ hi
n i=1
Senvis the score of the charge envelope as a whole, n is the total number of
peaks, Si is the score of peak i, and hi is the height of peak i. In other words,
the score of the charge envelope is the average of each peak’s score, weighted
by each peak’s height. The peaks at the edges of the charge envelope have
more height error due to their small intensities; scaling their scores by their
height ensures that they do not override the whole envelope’s score.
viii) Choose the lowest scoring charge state in Automatic mode. In manual mode
present the user with a choice of the best 5 scoring charge state assignments.
ix) Mark off the peaks that have been simulated, but do not subtract them out.
Then, for the most intense peak that has not yet been simulated, repeat the
search. Search for charge envelopes up to the limit defined by the user (default
is 5) or until no peaks remain. With the exception of the initial peak, each of
these charge envelopes is allowed to contain peaks that belong to another
charge envelope in order to account for overlap.
x) Fit a linear combination of charge envelopes to the spectrum. In this way, the
signals of the peaks in overlapping charge envelopes are apportioned among
the charge envelopes. At the same time, if too many envelopes are fitted
simultaneously the smaller envelopes may be overestimated.
xi) PeakSeeker returns a display with the original spectrum, the simulated charge
envelopes, and the sum of the simulated charge envelopes overlaid with the
original spectrum. It also returns a spectrum with the charge envelopes
subtracted out. Using these displays, the user can determine the goodness of
fit of the calculated envelopes, and whether there still remain undeconvoluted
masses. The subtracted spectrum is written to another file for further
deconvolution.
xii) Optionally repeat this cycle for another 5 envelopes.
Table 1: Comparison of Algorithms for Deconvolution of Native Mass Spectra
MaxEnt UniDec
ZSCORE
Massign
Automas PeakSeeke
s
r
Charge
Charge
Charge
Informa Informa Charge
Deconvolution
Assignmen
Assignmen
Assignmen
Assignment
tion
tion
t
t
t
Type
Theory Theory
Simulate
Minimize
Simulate
Maximu Bayesia Score
best-fit
standard
best-fit
m
n
charge
charge
deviation
charge
Entropy Deconv states by envelopes of mass
envelopes
Method
olution
total
intensity
n/a
Intensit
y
Thresho
ld
Intensity
Threshol
d
Fixed and
Adjusted
Threshold
Intensity
Threshold
Intensity
Threshold,
or Fixed and
Adjusted
Threshold,
or
Continuous
Wavelet
Transform
n/a
None
None
Local
Maxima
None
Local
maxima
and
Second
Derivative
; leastsquares fit
to obtain
underlyin
g peak
parameter
s
n/a
None
None
Apportion
signal by
“peeling”
envelopes
and
refitting
Assign to
one
envelope
by game
theory
Apportion
signal by
leastsquares
fitting of
combination
of
envelopes
Effectiv
e
Effectiv
e
Limited
efficacy
Effective
Untested
Effective
Inconsis Unteste
tent
d
Results
Limited
efficacy
Effective
Untested
Effective
Peak Detection
Overlapping Peak
Detection
Peak Belonging to
Multiple
Envelopes
Spectra of highly
interleaved
peaks?
Noisy Spectra?
Automa
ted
Multiple
Processi
ng, Peak
Detectio
n and
Deconv
olution
Options
Automat
ed
Multiple Automat
Processin ed
g, Peak
Detection
and
Deconvol
ution
Options
Propriet
ary
softwar
e
6, 7
Free
binary for
Academic
Use.
Proprieta
ry
software
Free
binary for
Academic
Use.
Not
distributed
8
2
1
5, 9
User Autonomy
Availability
References
Multiple
Processing
, Peak
Detection
and
Deconvolu
tion
Options
Free and
OpenSource
Supplemental References:
1.
Morgner, N.; Robinson, C. V., Massign: An Assignment Strategy for Maximizing
Information from the Mass Spectra of Heterogeneous Protein Assemblies. Anal.
Chem. 2012, 84, 2939–2948. DOI: 10.1021/ac300056a
2.
Zhang, Z.; Marshall, A. G., A universal algorithm for fast and automated charge
state deconvolution of electrospray mass-to-charge ratio spectra. J. Am. Soc.
Mass Spectrom. 1998, 9, 225–233. DOI: 10.1016/S1044-0305(97)00284-5.
3.
Savitzky, A.; Golay, M. J. E., Smoothing and Differentiation of Data by Simplified
Least Squares Procedures. Anal. Chem. 1964, 36, 1627-1639. DOI:
10.1021/ac60214a047
4.
Du, P.; Kibbe, W. A.; Lin, S. M., Improved peak detection in mass spectrum by
incorporating continuous wavelet transform-based pattern matching.
Bioinformatics. 2006, 22, 2059–2065. DOI: 10.1093/bioinformatics/btl355.
5.
Tseng, Y.-H.; Uetrecht, C.; Heck, A. J. R.; Peng, W.-P., Interpreting the Charge
State Assignment in Electrospray Mass Spectra of Bioparticles. Anal. Chem.
2011, 83, 1960–1968. DOI: 10.1021/ac102676z.
6.
Ferrige, A.G., Seddon, M.J., Jarvis, S., Skilling, J., Aplin, R.: Maximum entropy
deconvolution in electrospray mass spectrometry. Rapid Commun. Mass
Spectrom. 5, 374–377 (1991).
7.
Ferrige, A.G., Seddon, M.J., Green, B.N., Jarvis, S.A., Skilling, J., Staunton, J.:
Disentangling electrospray spectra with maximum entropy. Rapid Commun.
Mass Spectrom. 6, 707–711 (1992).
8.
Marty, M.T., Baldwin, A.J., Marklund, E.G., Hochberg, G.K.A., Benesch, J.L.P.,
Robinson, C.V.: Bayesian Deconvolution of Mass and Ion Mobility Spectra: From
Binary Interactions to Polydisperse Ensembles. Anal. Chem. 87, 4370–4376
(2015).
9.
Tseng, Y.-H., Uetrecht, C., Yang, S.-C., Barendregt, A., Heck, A.J.R., Peng, W.P.: Game-Theory-Based Search Engine to Automate the Mass Assignment in
Complex Native Electrospray Mass Spectra. Anal. Chem. 85, 11275–11283
(2013).
Download