pmic8060-sup-0001-SupMat

advertisement
Supplementary Information
Workflows for automated downstream data analysis and visualization in
large-scale computational mass spectrometry
Stephan Aiche1,*, Timo Sachsenberg2, Erhan Kenar3, Mathias Walzer2, Bernd
Wiswedel4, Theresa Kristl5, Matthew Boyles6, Albert Duschl6, Christian G. Huber5,
Michael R. Berthold7, Knut Reinert1, and Oliver Kohlbacher2
1
Department of Mathematics and Computer Science, Freie Universitaet Berlin,
Germany
2
Applied Bioinformatics, Center for Bioinformatics, Quantitative Biology Center, and
Dept. of Computer Science, University of Tübingen, Germany
3
Quantitative Biology Center (QBiC), University of Tübingen, Germany
4
KNIME.com AG, Zurich, 8005, Switzerland
5
Division of Chemistry and Bioanalytics, Department of Molecular Biology, University
of Salzburg, Austria
6
Division of Allergy and Immunology, Department of Molecular Biology, University of
Salzburg, Austria
7
Chair for Bioinformatics and Information Mining, Department of Computer and
Information Science, University of Konstanz, Germany
*Corresponding Author: Stephan Aiche, Takustr. 9, 14195 Berlin, Germany, tel.: +49
30 838 75137, fax: +49 30 838 75218, email: stephan.aiche@fu-berlin.de
Example 1 (TMT Quantitation)
Experimental procedure: Human lung adenocarcinoma epithelial cells (A549) were
treated with 10µg/ml (2.1µg/cm2 exposure surface) nano-copper oxide (CuO) or left
untreated (three biological replicates). Cells were harvested at 6 different time points
(0, 1, 3, 6, 12 and 24 hours after exposure to CuO or medium only controls) and
subsequently lysed, reduced (4.55 mM tris(2-carboxyethyl)phosphine hydrochloride
solution, Sigma Aldrich, St. Louis, MO, USA) and alkylated (8.70 mM iodo
acetamide, Sigma Aldrich). The tryptic peptides (Trypsin, Promega, Madision, WI,
USA) were labeled with 6-plex isobaric tags (TMT 6-plex, Thermo Fisher Scientific)
and afterwards separated by capillary ion-pair reversed-phase HPLC (U3000 nano
HPLC Unit from Dionex, Germering, Germany) in a 150 x 0.20 mm i.d. monolithic
column (produced in-house according to ref. [1], commercially available as
ProSwift™ columns from Thermo Scientific) with a 5 h 0-40% acetonitrile gradient at
1.0 µL/min followed by linear ion trap-Orbitrap mass spectrometry (LTQ Orbitrap XL
from Thermo Scientific). Each sample was measured three times with the use of two
exclusion lists facilitating the identification of more proteins compared to a single
measurement.
Workflow details: The input for the workflow consists of the sequence database
(Uniprot, human) including the reversed sequences for decoy search, a text file
containing the experimental layout in CSV (comma separated values) format where
for each experiment the condition (control or treated), the replicate number, and the
association between the TMT channel and the respective time point is given, and the
18 mzML files containing the unprocessed MS data.
The workflow starts by analyzing all the mzML files independently performing
identification and TMT reporter ion extraction. Initially each file is separated into two
smaller files, one containing only the precursor and the CID scans, the other
containing only the precursor and HCD (Higher-energy collisional dissociation)
scans. The CID scans are used for identifications using OMSSA [2] and X!Tandem
[3]. The HCD scans are used for reporter ion extraction (including an isotopic
impurity correction as described in [4]) and identifications using again OMSSA [2]
and X!Tandem [3]. The four individual identification results are subsequently
combined by the ConsensusID [5] approach. Afterwards, the identification results are
mapped to the quantitative information extracted from the TMT reporter channels.
The resulting data is converted into a format readable by the R package isobar [6].
The workflow then uses the KNIME R nodes to apply isobar to the data extracted
from the MS data with respect to the given experimental design. After initial
quantitation of the individual experiments the data is combined on a protein level
(i.e., group results from the different experiments by proteins) and different plots are
generated to allow visual inspection of the results and quality control.
Example 2 (Label-free quantitation of metabolites):
The label-free quantitation workflow for metabolites demonstrates differential
analysis of small molecules as used in biomarker discovery. Two spike-in conditions
from a previously characterized dilution of isotopically labeled compounds series (0.5
mg/l and 10.0 mg/l against male blood background) were measured using UPLC-MS.
For details on the sample, sample processing, and data acquisition, we refer to [6].
The input of the workflow comprised six mzML files (two conditions, each measured
in triplicates). Mass trace detection of eluting small-molecules was performed with
the FeatureFinderMetabo [6] node. In the subsequent nodes, retention time shifts
between measurements were corrected (MapAlignerPoseClustering [7]) and features
occurring at the same mass-to-charge ratio and retention time were linked between
MS runs (FeatureLinkerUnlabeledQT [8]). Identifications were assigned by
AccurateMassSearch querying against the HMDB database (extended with the
isotopically labeled spike-in compounds). The results were converted into the mzTab
data format [7] which was parsed using the SmallMoleculeMzTabReader node to
convert it into a KNIME table. The column containing the chemical formula was then
converted into a KNIME compatible molecule type (Molecule Type Cast) to allow
interoperability with other KNIME extensions (i.e., cheminformatics packages). In the
lower part of the workflow, statistical analysis was performed with the integrated R
framework. First, quantification values were normalized by quantiles [9]. Multiple
hypotheses testing (two sided t-test) and a minimum fold change of two were used to
determine differentially quantified features. P-values were corrected for FDR control
using the Benjamini-Hochberg procedure. The quantification results passing the FDR
threshold of 5% were then joined with the identification results into a table for manual
inspection. As chemical formulas were converted to a KNIME-compatible molecule
type, their molecule structure could be easily visualized with cheminformatics
packages (e.g., CDK).
Example 3 (Quality Control):
The shown workflow is a comprehensive qcML workflow, utilizing many of the more
advanced features of qcML. In addition to the generated qcML file it also generated a
report in pdf format.
The workflow starts with two input nodes, providing the mzML files that should be
analyzed and the sequence database used for identification.
The first step in the workflow is the preprocessing node, which performs feature
detection and identification, each resulting in a file that can be used to assess the
quality of the experiment.
The QCCalculator will take the two generated files and the original mzML file to
calculate basic statistics and agglomerate the quality data needed later on to apply
more advanced quality metrics. All will be stored in a qcML file, which will make it
easy to access the needed data. More data will be added before handing the
extended file over to the next step.
The ID Ratio Meta-node will create a plot of the measured spectra vs. the identified
spectra in a m/z vs. RT map. The plot will be included in the qcML file and in addition
sent to the reporting tool.
In the Mass Accuracy Meta-node, the accuracy of the measurement will be analyzed
by reference to the identifications. The calculated median deviation and the
corresponding plot will be added to the qcML file. In addition the mass accuracy over
the elution time is plotted.
In the following Fractional Mass Meta-node an external reference file
(theoretical_masses.txt) of theoretical masses is used to plot the experimentally
acquired fractional masses on the theoretically possible.
The last Meta-node is plotting the total ion current of the experiment over time.
If additional data is available in tabular format, it can be added as well to the qcML
file. In this case, the injection times of the machine is added to the qcML file.
Subsequently, some of the computed quality metrics from the qcML file are extracted
in the following Meta-node (Extraction of QP details) to make the accessible for the
KNIME reporting engine.
The QCShrinker finally removes verbose and redundant data.
In addition to the qcML file the workflow will export numerous plots and quality
metrics to the KNIME reporting engine. After the workflow is executed one can open
the report view in KNIME to create a pdf file containing the plots and quality values.
References
[1] Premstaller, A., Oberacher, H., Huber, C. G., High-performance liquid
chromatography-electrospray ionization mass spectrometry of single- and doublestranded nucleic acids using monolithic capillary columns. Anal Chem 2000, 72,
4386-4393.
[2] Geer, L. Y., Markey, S. P., Kowalak, J. A., Wagner, L., et al., Open mass
spectrometry search algorithm. J Proteome Res 2004, 3, 958-964.
[3] Craig, R., Beavis, R. C., TANDEM: matching proteins with tandem mass spectra.
Bioinformatics 2004, 20, 1466-1467.
[4] Bielow, C., 2012, p. 147 S.
[5] Nahnsen, S., Bertsch, A., Rahnenführer, J., Nordheim, A., Kohlbacher, O.,
Probabilistic consensus scoring improves tandem mass spectrometry peptide
identification. Journal of proteome research 2011, 10, 3332-3343.
[6] Breitwieser, F. P., Muller, A., Dayon, L., Kocher, T., et al., General statistical
modeling of data from protein relative expression isobaric tags. J Proteome Res
2011, 10, 2758-2766.
[7] Griss, J., Jones, A. R., Sachsenberg, T., Walzer, M., et al., The mzTab Data
Exchange Format: communicating MS-based proteomics and metabolomics
experimental results to a wider audience. Molecular & Cellular Proteomics 2014.
Download