Reference

advertisement
Support Information
Materials and Methods
Datasets
D1, a complex sample dataset derived from the human liver, is used to evaluate the
sensitivity of peptide identification. The detailed generation process is described in
reference[1]. In brief, the sample was digested by trypsin, separated by strong cation
exchange chromatography, and analyzed using the LTQ-FT mass spectrometer
(Thermo Scientific, San Jose, CA). The raw data were converted to peak lists using
iPE-MMR[2]. iPE-MMR is a combined method for precursor mass refinement that
integrates DeconMSn[3], PE-MMR[4], and DtaRefinery[5] into an analysis pipeline
to calibrate the monoisotopic mass error and systematic mass error for tandem mass
spectrometric
data.
The
software
(version
1.2)
was
downloaded
from
http://omics.pnl.gov/software/PEMMR.php. We used the default parameter settings in
iPE-MMR to perform the conversion. This ultimately resulted in 24,302 spectra. After
the calibration, the mean value of the precursor ion mass error distribution was
improved from 7.48±3.37 to 0.00±1.72 ppm.
D2, a protein standard dataset derived from a set of 48 human proteins (Sigma,
Universal Proteomics Standard Set UPS1), was previously used to demonstrate the
accuracy of MP[6]. The raw data, generated by LTQ-FT mass spectrometer, was
converted into peak lists with BioWorks 3.2 (Thermo Scientific). The preprocessing
parameters were described in detail by Brosch et al.[6]. Note that systematic mass
errors of the precursor ions have already been eliminated[6]. The MS/MS spectra
(8191
spectra)
were
downloaded
ftp://ftp.sanger.ac.uk/pub4/resources/software/mascotpercolator/.
from
Database construction
The human IPI database (version 3.63, 84,229 sequences) including common
external contaminants from the common Repository of Adventitious Proteins (cRAP,
http://www.thegpm.org/crap/index.html) was used as the target database for D1.
Another human IPI database (68,322 sequences) including 48 standard protein
sequences and common external contaminants from cRAP was used as the target
database for D2[6]. The database was provided by Brosch et al.[6], and was
downloaded from ftp://ftp.sanger.ac.uk/pub4/resources/software/mascotpercolator/.
The decoy databases used for D1 and D2 were constructed by randomizing the
protein sequences while maintaining the average amino acid composition (RND). The
decoy databases were generated by the Perl script decoy.pl, which is provided by
Matrix Science (http://www.matrixscience.com/help/decoy_help.html). For D2, three
decoy databases were generated to eliminate the statistical bias of q-value estimation.
MS/MS database searching
D1 and D2 were searched with MASCOT (version 2.2) using the following
parameters: precursor mass tolerance was set to 20 ppm; monoisotopic mass was used
for precursor ions; product ion mass tolerance=0.5 Da; three modifications,
carbamidomethylation of cysteine, deamidation of asparagine or glutamine, and
oxidation of methionine, were set as variable modifications; maximum missed
cleavage =2. Each dataset was searched against the target and decoy database
separately with the enzyme settings of trypsin and semi-trypsin, respectively.
QC methods
Two QC methods using the same Percolator technology[7], MP[6-7], and an
in-house developed tool, PepDistiller, were compared in this study.
MP: MASCOT Percolator (version 1.09, packaged with Percolator version
1.12)[6-7] was used for the comparison. MP was specially designed to achieve
maximum target hits above a user-specified FDR threshold (e.g., 1%) through an
iterative support vector machine (SVM) classifier, i.e. Percolator[7]. The iterative
procedure is implemented by selecting a subset of high-confident target PSMs from
the previous iteration to serve as a positive training set and half of the decoy PSMs to
serve as a negative training set for training an SVM and re-ranking the entire set of
PSMs in the next iteration. After several iterations, the number of target PSMs above
the FDR threshold converges. The final SVM is then applied to the entire set of target
PSMs and the remaining half of the decoy PSMs to obtain unbiased q-value
estimations. MP extracts PSM features from MASCOT dat files as an input feature
vector to Percolator. The semi-supervised characteristic of Percolator also makes it
adaptive to datasets from different experimental conditions, such as different samples
and mass spectrometers. The features applied in MP can be found in the
'config.properties' file in the software package, and the detailed definitions of the
features are described in ref. [6].
PepDistiller: PepDistiller was designed as a quality control method to distill
high-confident peptide identifications. It includes all of the features used in MP and
executes Percolator (version 1.12) to discriminate between the correct and incorrect
matches. However, there are two major improvements: (1) in addition to the feature
set used in MASCOT Percolator, NTT was added to PepDistiller to improve the
performance of peptide identifications obtained from semi-tryptic search results
(Table S1); (2) the refined FDR estimation method proposed by Navarro et al.[8], was
integrated into PepDistiller to accurately determine the confidence of peptide
identifications. PepDistiller is written in Perl, and can be downloaded from
http://bioinfo.hupo.org.cn/tools/PepDistiller.
Method of FDR calculation
Based on the target-decoy strategy, several methods have been used to calculate the
FDR[7-11]. Because Percolator is specially designed for separate searches, two FDR
estimation methods designed for separate searches were considered in this study. One
is the method applied in Percolator that we termed the PIT-fixed FDR estimation,
which is described as follows.
Denote the scores of target PSMs as t1 , t 2 , …, t mt and the scores of decoy PSMs
as d1 , d 2 , …, d md . Here, mt is the number of target PSMs, and md is the number
of decoy PSMs. For a given threshold s, in SS, FDR is calculated as follows[6-7] :
0
E{FDRPIT ( s)} 
mt
|{di  s, i  1, 2,..., md }|
md
(1),
|{ti  s, i  1, 2,..., mt }|
where  0 (PIT) is the estimated proportion of target PSMs that are incorrect, which
can also be calculated by Qvality[12].
The other method, named the refined FDR estimation, was proposed by Navarro et
al.[8], and integrated into PepDistiller. The refined FDR is simply calculated as
follows:
E{FDRRefined ( s)} 
do  2db
(2),
db  tb  to
where do (decoy only) is the number of PSMs with scores above the threshold s in the
decoy database (DDB) but not in the target database (TDB). Analogously, to (target
only) is the number of PSMs with scores above the threshold s in TDB but not in
DDB. db (decoy better) is the number of PSMs with scores above s in both TDB and
DDB but with better scores in DDB, and tb (target better) is the number of PSMs with
scores above s in both TDB and DDB but with better scores in TDB.
After FDRs were estimated and given a PSM with score s, the q-value associated
with it was calculated using formula (3) as follows:
q( s)  min x s E{FDR( x)} (3),
where FDR could be either the PIT-fixed FDR or the refined FDR.
Given a threshold s, the actual FDR is calculated as FP/(FP+TP), where FP is the
number of false positive hits with scores above s, and TP is the number of true
positive hits with scores above s that belong to the standard proteins in the sample.
After the actual FDR was calculated for each threshold, the actual q-value was
calculated in the same manner as formula (3). For D2, after each target-decoy
database search, the estimated q-values were plotted against the actual q-values to
reveal the accuracy of estimated q-values. The three scatter plots were then smoothed
with the local regression method LOESS[13].
Table S1. Features used in PepDistiller to represent PSMs. In total, 17 features are
applied in PepDistiller as an input feature vector to Percolator, of which 16 are
inconsistent with the features used in MP. These features can be divided into three
aspects: 1–5 represent features related to the mass error of parent and product ions;
6–10 represent features related to the properties of identified peptides; and 11-17
represent features related to the quality of PSMs.
ID
Feature
1
DeltaM
2
AbsDeltaM
Defintion
Difference between the calculated and observed peptide mass
(in Dalton and ppm)
Absolute value of DeltaM (in Dalton and ppm)
3
IsoDeltaM
Isotopic error corrected DeltaM (in Dalton and ppm)
4
FragDelaM_meidan
5
FragDelaM_iqr
6
MrCalc
Median of fragment ions deltaM (in Dalton and ppm)
Interquartile range(IQR) of fragment ions deltaM (in Dalton
and ppm)
Calculated monoisptopic mass of identified peptide
7
Charge
Charge state of parent ion
8
MC
9a
NTT
10
VarMods
Number of missed tryptic cleavages
Number of tryptic termini (for fully-tryptic search, it equals to
0)
Number of modified sites / number of modifiable sites
11
IonScore
12
dIonsScore
13
FractionsIonsMarchedB1-Y2
MASCOT ion score
Difference between ion scores of the best and second-best
non-isobaric match
Fractions of matched ions (per ion series)
14
TotInt
The sum of all ions intensities (log)
15
IntMatchedTot
The sum of all matched ions intensities (log)
16
RelIntMatchedTot
IntMatchedTot / TotInt
17
RelIntMatchedB1-Y2
Relative intensity matched (per ion series)
a NTT is not included in the feature set of MP but is incorporated into PepDistiller.
Table S2. Homology FP matches with q-values lower than 0.01 generated from the
standard dataset (D1) semi-tryptic search results.
Spectrum
Peptide Sequence
Ion Score
ppm
FTHPS_2007Sept07-01.6149.6149.2.dta
DVTVLQNTDGNNNDAWAK
109.47
1.51
FTHPS_2007Sept07-01.1562.1562.2.dta
QNTDGNNNDAWAK
76.57
1.61
FTHPS_2007Sept07-01.6320.6320.2.dta
DVTVLQNTDGNNNDAWAK
71.08
-1.15
FTHPS_2007Sept07-01.4711.4711.2.dta
DTPSLEDEAAGHVTQAR
62.76
3.66
FTHPS_2007Sept07-01.6152.6152.3.dta
TTAEEAGIGDTPSLEDEAAGHVTQAR
52.43
2.13
FTHPS_2007Sept07-01.5625.5625.3.dta
GIGDTPSLEDEAAGHVTQAR
39.32
0.19
FTHPS_2007Sept07-01.5759.5759.3.dta
GIGDTPSLEDEAAGHVTQAR
38.54
-0.18
FTHPS_2007Sept07-01.6186.6186.3.dta
GTTAEEAGIGDTPSLEDEAAGHVTQAR
32.96
1.69
FTHPS_2007Sept07-01.5147.5147.3.dta
GIGDTPSLEDEAAGHVTQAR
32.14
0.27
FTHPS_2007Sept07-01.11299.11299.2.dta
GAELVDALQFVCGDR
88.86
2.93
FTHPS_2007Sept07-01.10877.10877.2.dta
AELVDALQFVCGDR
84.53
1.44
FTHPS_2007Sept07-01.10699.10699.2.dta
GAELVDALQFVCGDR
75.50
0.12
FTHPS_2007Sept07-01.10727.10727.2.dta
GAELVDALQFVCGDR
51.87
1.97
Homology Sequence
Standard Protein
DVTVLQNTDGNNNEAWAK
TRFL_HUMAN
GTTAEEAGIGDTPSLEDEAAGHVTQEP
TAU_HUMAN
GGELVDTLQFVCGDR
IGF2_HUMAN
Table s3. Root mean square error (RMSE) between the estimated and actual q-values
in the interval of [0, 0.06] generated by two different FDR calculation methods and
six different decoy designs.
FDR Type
PIT-fixed
Refined
Decoy Design
q-value RMSE (1e-2)
Mean
σ
RND
0.78
0.02
SHF
0.76
0.05
REV
0.67
\
RNDTP
0.73
0.01
SHFTP
0.72
0.05
REVTP
0.72
\
RND
0.73
0.06
SHF
0.68
0.02
REV
0.56
\
RNDTP
0.60
0.01
SHFTP
0.65
0.02
REVTP
0.65
\
A
5000
4500
4000
Target PSMs
3500
3000
2500
2000
1500
MP
PD, FDRPIT
1000
500
0
0
PD, FDRRefined
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
q-value
0.1
B
0.1
0.09
FDRRefine (RND, NoEnzy)
0.08
FDRPIT (RND, NoEnzy)
Y=X
Actual q-value
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Estimated q-value
Figure S1. (A) Comparison of the sensitivity of MASCOT Percolator (MP) and
PepDistiller (PD) for MASCOT none-enzymatic search results. Two kinds of FDR
estimation method were applied in PepDistiller: PIT-fixed FDR (FDRPIT) and the
refined FDR (FDRRefined). The number of target PSMs was plotted against each
q-value threshold. D1 dataset were used to test the sensitivity. (B) Evaluation of the
accuracy of FDR estimations generated by the refined and PIT-fixed methods. The
none-enzymatic search results of the standard dataset D2 were used. PepDistiller was
used for filtering. The dataset was searched against three RND decoy databases to
eliminate the statistical bias of q-value estimation. The curves were smoothed with the
local regression method LOESS.
A1
A2
0.1
5000
0.09
4800
0.08
4600
FDRRefined (RNDTP)
FDRPIT (RNDTP)
Y=X
Actual q-value
Target PSMs
0.07
4400
4200
4000
0.06
0.05
0.04
0.03
3800
3600
3400
0
MP
PD, FDRPIT (RNDTP)
0.02
0.01
PD, FDRRefined (RNDTP)
0
0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
q-value
B1
B2
5000
0.1
4800
0.09
4600
0.08
4200
4000
3800
3600
3400
3200
FDRPIT (SHF)
Y=X
0.06
0.05
0.04
0.03
0.02
3000
MP
PD, FDRPIT (SHF)
2800
PD, FDRRefined (SHF)
0.01
2600
0
FDRRefined (SHF)
0.07
Actual q-value
Target PSMs
4400
0
0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
q-value
C1
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Estimated q-value
C2
5200
0.1
4800
0.09
0.08
3600
3200
2800
2400
2000
1600
0
FDRRefined (SHFTP)
FDRPIT (SHFTP)
Y=X
0.07
4000
Actual q-value
Target PSMs
4400
0.06
0.05
0.04
0.03
MP
PD, FDRPIT (SHFTP)
0.02
PD, FDRRefined (SHFTP)
0.01
0
0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
q-value
D1
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Estimated q-value
D2
0.1
5000
0.09
Y=X
FDRPIT (REV)
0.08
FDRRefine (REV)
4800
0.07
Actual q-value
Target PSMs
4600
4400
4200
4000
3800
3600
3400
0
E1
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Estimated q-value
0.06
0.05
0.04
0.03
MP
PD, FDRPIT (REV)
0.02
PD, FDRRefined (REV)
0.01
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
q-value
0
0
E2
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Estimated q-value
0.1
4800
0.09
4600
0.08
4400
0.07
Actual q-value
Target PSMs
5000
4200
4000
3800
3600
3400
FDRRefined (REVTP)
0.06
0.05
0.04
0.03
3200
MP
PD, FDRPIT (REVTP)
0.02
3000
PD, FDRRefined (REVTP)
0.01
2800
0
Y=X
FDRPIT (REVTP)
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
q-value
0
0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Estimated q-value
Figure S2. (A1-E1) Comparison of the sensitivity of MASCOT Percolator (MP) and
PepDistiller (PD) for MASCOT semi-tryptic search results using five kinds of decoy
databases. Two kinds of FDR estimation method were applied in PepDistiller:
PIT-fixed FDR (FDRPIT) and the refined FDR (FDRRefined). The number of target
PSMs was plotted against each q-value threshold. D1 dataset were used to test the
sensitivity. (A2-E2) Evaluation of the accuracy of FDR estimations generated by the
refined and PIT-fixed methods. The semi-tryptic search results of the standard dataset
D2 were used. Five kinds of decoy databases were applied. PepDistiller was used for
filtering.
The five decoy design methods are: RNDTP, SHF, SHFTP, REV, and REVTP. (1)
RNDTP represents randomizing the amino acids of in silico tryptic peptides except
the tryptic cleavage sites (K and R) using a uniform distribution random number
generator, while preserving the average amino acid composition, protein length,
tryptic cleavage sites and positions of target proteins; (2) SHF represents utilizing
Fisher-Yates shuffle algorithm[14] to uniformly shuffle the amino acids without
introducing skewness[15]; (3) SHFTP represents shuffling in silico tryptic peptides as
SHF method does, while preserving average amino acid composition, protein length,
and tryptic cleavage sites (K and R) and positions[16]; (4) REV represents reversing
target protein sequences[17]; (5) REVTP represents reversing in silico tryptic peptides
while preserving average amino acid composition, protein length, tryptic cleavage
sites (K and R) and positions. REV was generated by a perl script decoy.pl provided
by Matrix Science. The others were generated by in-house developed perl scripts,
which can be downloaded from http://bioinfo.hupo.org.cn/tools/PepDistiller. When
using RNDTP, SHF and SHFTP, the dataset was searched against three decoy
databases generated by each method to eliminate the statistical bias of q-value
estimation. The curves were smoothed with the local regression method LOESS.
Figure S3. Comparison of process time consumed by PepDistiller (PD) and Mascot
Percolator (MP) on dataset D2.
Reference
[1] Zhang, J., Li, J., Liu, X., Xie, H., et al., A nonparametric model for quality control
of database search results in shotgun proteomics. BMC Bioinformatics 2008, 9, 29.
[2] Jung, H.-J., Purvine, S. O., Kim, H., Petyuk, V. A., et al., Integrated
Post-Experiment Monoisotopic Mass Refinement: An Integrated Approach to
Accurately Assign Monoisotopic Precursor Masses to Tandem Mass Spectrometric
Data. Analytical Chemistry 2010, 82, 8510-8518.
[3] Mayampurath, A. M., Jaitly, N., Purvine, S. O., Monroe, M. E., et al., DeconMSn:
a software tool for accurate parent ion monoisotopic mass determination for tandem
mass spectra. Bioinformatics 2008, 24, 1021-1023.
[4] Shin, B., Jung, H.-J., Hyung, S.-W., Kim, H., et al., Postexperiment Monoisotopic
Mass Filtering and Refinement (PE-MMR) of Tandem Mass Spectrometric Data
Increases Accuracy of Peptide Identification in LC/MS/MS. Mol Cell Proteomics
2008, 7, 1124-1134.
[5] Petyuk, V. A., Mayampurath, A. M., Monroe, M. E., Polpitiya, A. D., et al.,
DtaRefinery, a software tool for elimination of systematic errors from parent ion mass
measurements in tandem mass spectra data sets. Mol Cell Proteomics 2010, 9,
486-496.
[6] Brosch, M., Yu, L., Hubbard, T., Choudhary, J., Accurate and sensitive peptide
identification with Mascot Percolator. J Proteome Res 2009, 8, 3176-3181.
[7] Kall, L., Canterbury, J. D., Weston, J., Noble, W. S., MacCoss, M. J.,
Semi-supervised learning for peptide identification from shotgun proteomics datasets.
Nat Methods 2007, 4, 923-925.
[8] Navarro, P., Vazquez, J., A refined method to calculate false discovery rates for
peptide identification using decoy databases. J Proteome Res 2009, 8, 1792-1796.
[9] Elias, J. E., Gygi, S. P., Target-decoy search strategy for increased confidence in
large-scale protein identifications by mass spectrometry. Nat Methods 2007, 4,
207-214.
[10] Kall, L., Storey, J. D., MacCoss, M. J., Noble, W. S., Assigning significance to
peptides identified by tandem mass spectrometry using decoy databases. J Proteome
Res 2008, 7, 29-34.
[11] Hather, G., Higdon, R., Bauman, A., von Haller, P. D., Kolker, E., Estimating
false discovery rates for peptide and protein identification using randomized databases.
Proteomics 2010, 10, 2369-2376.
[12] Kall, L., Storey, J. D., Noble, W. S., Non-parametric estimation of posterior error
probabilities associated with peptides identified by tandem mass spectrometry.
Bioinformatics 2008, 24, i42-48.
[13] Cleveland, W. S., Devlin, S. J., Locally weighted regression: an approach to
regression analysis by local fitting. Journal of the American Statistical Association
1988, 83, 596-610.
[14] Fisher, R., Yates, F., Statistical tables for biological, agricultural and medical
research, Oliver & Boyd, London 1948, 26-27.
[15] Klammer, A. A., MacCoss, M. J., Effects of modified digestion schemes on the
identification of proteins from complex mixtures. J Proteome Res 2006, 5, 695-700.
[16] Zhang, J., Li, J., Xie, H., Zhu, Y., He, F., A new strategy to filter out false positive
identifications of peptides in SEQUEST database search results. Proteomics 2007, 7,
4036-4044.
[17] Moore, R. E., Young, M. K., Lee, T. D., Qscore: an algorithm for evaluating
SEQUEST database search results. J Am Soc Mass Spectrom 2002, 13, 378-386.
Download