Supplemental Information Improved Peak Detection and Deconvolution of Native Electrospray Mass Spectra from Large Protein Complexes Jonathan Lu†, Michael J. Trnka†, Soung-Hun Roh, Philip J. J. Robinson, Carrie Shiau, Danica Galonic Fujimori, Wah Chiu, Alma L. Burlingame, and Shenheng Guan* Figure S1. Schematic of Software Workflow. A. B. 12mer 10mer RNA pol II 10mer (∆rpb4/7) RNA pol II 12mer Measured: 470,170 Da Theoretical: 469,682 Da Measured: 514,728 Da Theoretical: 514,154 Da C. +2% DMSO Figure S2. A) Native MS of RNA pol II in the absence of -amanitin showing charge distributions for decameric and dodecameric pol II complexes centered on 49+ and 50+ charge states. B) The deconvoluted mass values match the expected within 0.1%. C) Addition of 2% DMSO to the sample induces: the addition of ~15 protons, wider distribution of charges, and the partial dissociation of Rpb 4 and Rpb7 from the assembly. Figure S3. Performance of different peak detection programs on spectrum of H3K9-Me1 mononucleosome analog. Top panel: the Find Peak Lists program in the Massign [1] suite was applied with threshold of 6% of the maximum, a reduce factor of 1, and the “rise stepwise” option enabled from 5% to 95% with stepsize of 10%. Massign detects shoulder peaks either at either lower or higher m/z but not both for the main charge state. Second panel: Magtran [2] detects only the major charge state species regardless of parameters. Third panel: Thermo Scientific Protein Deconvolution software, a MaxEnt based method, detects the major species as well as the higher mass shoulder, but misses the lower mass shoulder. Lower panel: PeakSeeker identifies three overlapping species for each of the major charge states using the 2nd derivative processing method. Figure S4. Performance of different deconvolution programs on spectrum of RNA Pol-II. A) PeakSeeker’s output. PeakSeeker finds 8 different mass species corresponding to decameric and dodecameric pol II complexes with up 4 molecules of amanitin bound. Massign performed similarly to PeakSeeker. B) Protein Deconvolution’s output. Protein Deconvolution finds the primary mass species, but misses several less abundant charge state envelopes. Magtran (not shown) produced no output. Figure S5. Performance of different deconvolution programs on spectrum of human TriC. A) PeakSeeker’s output. PeakSeeker finds the major mass species of 948 kDa. Massign’s was able to find the same mass as PeakSeeker due to its intensity modeling. B) The same spectra as annotated by Protein Deconvolution showing the two most abundant of ten species annotated in the mass range of the TRiC complex. C) Zerocharge spectrum produced by Protein Deconvolution. Magtran (not shown) produced no output. Detailed Description of the Algorithm. 1: User-defined parameters Options and parameters are defined in a single text file. Options include: choice of peak detection method, option to display the detected peaks, manual or automatic processing modes, choice of smoothing and background subtracting, and display and output saving options. Parameters include: the signal-to-noise threshold, a general charge range (default 1-100) and a mass range (default 1-1000000 Da) of the complex, and a mass tolerance value (default 5 m/z), and peak detection values. The user may also choose the maximum number of charge envelopes searched for during each iteration (default 5). The user may adjust the parameters of the scoring function. 2: Spectrum Preparation 2.A) Preprocessing First, the mass spectrum is averaged across several scans in order to reduce noise. The averaged spectrum is exported to a tab-delimited file. After PeakSeeker reads in the spectrum, the program can optionally perform background subtraction or use a noise filter. The background at each point is estimated as the minimum intensity of a window centered at that point, with size defined by the user. This background signal is then smoothed with a moving average, again with size defined by the user. 2.B) Filtering/Smoothing The user can choose to use either a moving average or a Savitzky-Golay filter [3]. The moving average replaces each intensity point with the average of all of the intensities within a user-defined window. The Savitzky-Golay filter, on the other hand, fits a polynomial of a given order to the points within a window, and replaces the data point at the center of that window with the intensity determined by that polynomial. The Savitzky-Golay filter is applied as many times as the user prefers. It has the advantage of preserving peak shape while reducing the high-frequency noise in the spectrum. 3: Peak Detection Peaks are detected using one of the following three methods. Method 1 simply detects local maxima above a fixed signal to noise ratio threshold. This method is used with filters and the background subtract. Method 2, adapted from the Massign [1] peak detection algorithm, adjusts this threshold to a factor of the intensity of the most recently detected peak to allow for detection of “shoulder” peaks. Method 3 uses a Continuous Wavelet Transform to address the problem of noisy spectra. The Continuous Wavelet Transform convolves the spectrum with a Mexican Hat wavelet across a range of likely peak widths. Narrow and high-frequency noise is filtered out while true peaks register above a signal-to-noise ratio across multiple widths [4]. The user can define the width range, signal-to-noise threshold, mass tolerance (for peaks to be considered the same), and minimum length of the ridgeline. Each of these methods returns a list of peak m/z locations. Each peak’s range is determined as existing between two local minima. This facilitates fitting of Gaussians to capture the peak parameters. 4: Peak Overlap Detection To detect peak overlap, we take the second derivative across a peak range, defined as the set of consecutive peaks which share local minima. This therefore accounts for the overlap between adjacent peaks. The second derivative of a Gaussian shaped peak consists of two zero-crossings surrounding a central minimum. Hence, this feature can be used to derive the number of underlying peaks in the range. We further define a threshold for the second derivative, so that minor shape anomalies do not register. This threshold is defined as a factor of the median of negative second derivative points. PeakSeeker then uses a Levenberg-Marquardt algorithm to fit Gaussian functions to the peak range. The locations of the second derivative minima, the heights in the peak range at those minima, and the full width at half maximum of the peak range divided by the number of peaks serve as starting guesses for the centers, heights, and widths of the Gaussians, respectively. This method improves over standard peak detection methods by finding peaks that are not local maxima. Fitting becomes difficult if too many adjacent peaks are detected. We divide the peak range into sections of 6 peaks. Each section shares one peak. We use a half Gaussian to fit this shared peak for each section. Then, we average the parameters of the two half-Gaussians to obtain the peak parameters. Finally, an optional filter selects peaks based on their widths (FWHM). 5: Charge State Assignment The centroids of every peak are calculated as a weighted average of the data points in their range. Charge states are then assigned as follows: i) ii) iii) iv) Starting with the most intense peak in the spectrum in Automatic mode, or with a user selected peak in Manual mode, iterate possible charge states to the peak and calculate the all possible masses lying with the range specified in the parameters. For each of the masses, calculate the theoretical m/z values of other charge states. Search for peaks at these m/z values by checking if their centroids fall within the user-defined mass tolerance. Create a set of peaks matching each molecular mass. Sort the masses by the number of peaks. Remove peak lists that contain less than a number of peaks defined in the parameters file (at least 3). v) Simulate charge envelopes for the 5 masses with the most matching peaks. This envelope is Gaussian-distributed over the charge domain because the charging event is a random process. The Gaussian’s height, center, and width are derived via a least-squares fit to the peak heights across the charge domain. If multiple peaks were found for each theoretical m/z value, average those peak heights. The starting guess for the center and height of the envelope Gaussian is derived from the starting peak, and the starting guess for the width is user-defined. If the Gaussian cannot be fit, go to the next mass. vi) Simulate the individual peaks comprising the charge envelope in the m/z domain. Each of these individual peaks is a Gaussian function whose height is derived from the Gaussian describing the overall envelope (step v) and is centered on the previously calculated m/z values. The width of each peak is proportional to the width of the starting peak and inversely proportional to its charge states. vii) Score the deviation between each of the individual peaks in the simulated charge envelope and the real signal: mz - mz' h - h' S = a´ +b´ mz h S is the score of the simulated peaks’ fit to the actual peak, mz is the center of the simulated peak, mz’ is the center of the real peak, his the height of the simulated peak, h’ is the height of the real peak, and a and b are user-defined weights. In other words, the score of each peak is a linear combination of the relative m/z errors and height errors. If there are multiple peaks corresponding to a simulated peak, the peak with the lowest score is chosen. The weights can be adapted. For a spectrum with many overlapping peaks, one might prefer relative height error over relative mass error, as the centroids of the overlapping peaks may be closely spaced. For a spectrum with many peaks from different components, one might prefer relative mass error over relative height error, as unrelated peaks may still form Gaussian-shaped envelopes. The score of the entire charge envelope is given by: 1 n Senv = å Si ´ hi n i=1 Senvis the score of the charge envelope as a whole, n is the total number of peaks, Si is the score of peak i, and hi is the height of peak i. In other words, the score of the charge envelope is the average of each peak’s score, weighted by each peak’s height. The peaks at the edges of the charge envelope have more height error due to their small intensities; scaling their scores by their height ensures that they do not override the whole envelope’s score. viii) Choose the lowest scoring charge state in Automatic mode. In manual mode present the user with a choice of the best 5 scoring charge state assignments. ix) Mark off the peaks that have been simulated, but do not subtract them out. Then, for the most intense peak that has not yet been simulated, repeat the search. Search for charge envelopes up to the limit defined by the user (default is 5) or until no peaks remain. With the exception of the initial peak, each of these charge envelopes is allowed to contain peaks that belong to another charge envelope in order to account for overlap. x) Fit a linear combination of charge envelopes to the spectrum. In this way, the signals of the peaks in overlapping charge envelopes are apportioned among the charge envelopes. At the same time, if too many envelopes are fitted simultaneously the smaller envelopes may be overestimated. xi) PeakSeeker returns a display with the original spectrum, the simulated charge envelopes, and the sum of the simulated charge envelopes overlaid with the original spectrum. It also returns a spectrum with the charge envelopes subtracted out. Using these displays, the user can determine the goodness of fit of the calculated envelopes, and whether there still remain undeconvoluted masses. The subtracted spectrum is written to another file for further deconvolution. xii) Optionally repeat this cycle for another 5 envelopes. Table 1: Comparison of Algorithms for Deconvolution of Native Mass Spectra MaxEnt UniDec ZSCORE Massign Automas PeakSeeke s r Charge Charge Charge Informa Informa Charge Deconvolution Assignmen Assignmen Assignmen Assignment tion tion t t t Type Theory Theory Simulate Minimize Simulate Maximu Bayesia Score best-fit standard best-fit m n charge charge deviation charge Entropy Deconv states by envelopes of mass envelopes Method olution total intensity n/a Intensit y Thresho ld Intensity Threshol d Fixed and Adjusted Threshold Intensity Threshold Intensity Threshold, or Fixed and Adjusted Threshold, or Continuous Wavelet Transform n/a None None Local Maxima None Local maxima and Second Derivative ; leastsquares fit to obtain underlyin g peak parameter s n/a None None Apportion signal by “peeling” envelopes and refitting Assign to one envelope by game theory Apportion signal by leastsquares fitting of combination of envelopes Effectiv e Effectiv e Limited efficacy Effective Untested Effective Inconsis Unteste tent d Results Limited efficacy Effective Untested Effective Peak Detection Overlapping Peak Detection Peak Belonging to Multiple Envelopes Spectra of highly interleaved peaks? Noisy Spectra? Automa ted Multiple Processi ng, Peak Detectio n and Deconv olution Options Automat ed Multiple Automat Processin ed g, Peak Detection and Deconvol ution Options Propriet ary softwar e 6, 7 Free binary for Academic Use. Proprieta ry software Free binary for Academic Use. Not distributed 8 2 1 5, 9 User Autonomy Availability References Multiple Processing , Peak Detection and Deconvolu tion Options Free and OpenSource Supplemental References: 1. Morgner, N.; Robinson, C. V., Massign: An Assignment Strategy for Maximizing Information from the Mass Spectra of Heterogeneous Protein Assemblies. Anal. Chem. 2012, 84, 2939–2948. DOI: 10.1021/ac300056a 2. Zhang, Z.; Marshall, A. G., A universal algorithm for fast and automated charge state deconvolution of electrospray mass-to-charge ratio spectra. J. Am. Soc. Mass Spectrom. 1998, 9, 225–233. DOI: 10.1016/S1044-0305(97)00284-5. 3. Savitzky, A.; Golay, M. J. E., Smoothing and Differentiation of Data by Simplified Least Squares Procedures. Anal. Chem. 1964, 36, 1627-1639. DOI: 10.1021/ac60214a047 4. Du, P.; Kibbe, W. A.; Lin, S. M., Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching. Bioinformatics. 2006, 22, 2059–2065. DOI: 10.1093/bioinformatics/btl355. 5. Tseng, Y.-H.; Uetrecht, C.; Heck, A. J. R.; Peng, W.-P., Interpreting the Charge State Assignment in Electrospray Mass Spectra of Bioparticles. Anal. Chem. 2011, 83, 1960–1968. DOI: 10.1021/ac102676z. 6. Ferrige, A.G., Seddon, M.J., Jarvis, S., Skilling, J., Aplin, R.: Maximum entropy deconvolution in electrospray mass spectrometry. Rapid Commun. Mass Spectrom. 5, 374–377 (1991). 7. Ferrige, A.G., Seddon, M.J., Green, B.N., Jarvis, S.A., Skilling, J., Staunton, J.: Disentangling electrospray spectra with maximum entropy. Rapid Commun. Mass Spectrom. 6, 707–711 (1992). 8. Marty, M.T., Baldwin, A.J., Marklund, E.G., Hochberg, G.K.A., Benesch, J.L.P., Robinson, C.V.: Bayesian Deconvolution of Mass and Ion Mobility Spectra: From Binary Interactions to Polydisperse Ensembles. Anal. Chem. 87, 4370–4376 (2015). 9. Tseng, Y.-H., Uetrecht, C., Yang, S.-C., Barendregt, A., Heck, A.J.R., Peng, W.P.: Game-Theory-Based Search Engine to Automate the Mass Assignment in Complex Native Electrospray Mass Spectra. Anal. Chem. 85, 11275–11283 (2013).