Spec - LabKey Server

advertisement
Enhanced MS2 SEQUEST Pipeline and Visualizations
Peter Hussey
LabKey Software
Draft 2.1: December 22, 2010
Introduction
This document describes several enhancements to the MS2 pipeline implementation
that uses the SEQUEST scoring algorithm. These enhancement are funded by Novartis
AG , and will be developed as part of the LabKey Server open source project.
In this project, the sequest.exe utility will be run as a "remote task runner" in Enterprise
Pipeline terminology, similar to a conversion server. We will also support indexed fasta
files as created by the makedb.exe utility. These two changes from the current
SequestQueue implementation should result in significantly improved throughput of
the analysis queue, enough to keep up with the rate of raw file generation by the
Orbitrap mass spec machine used by Novartis.
Other enhancements will include
 A mechanism that can be incorporated into a batch file that notifies LabKey
Server of a specific file posted to a pipeline directory. The mechanism initiates
processing on the file using a specified protocol.
 A Novartis-specific mechanism for using the CSV file generated by the Orbirtrap
operator to specify the sample handling and the intended search protocol for
the set of raw files in a pipeline directory.
 New visualizations available from Run Details and Run Compare views.
This project will support the sequest.exe and makedb.exe utilities that come with
Proteome Discoverer 1.1 from Thermo Fisher and on Windows 7 Professional.
Key Scenarios
All scenarios come from the User Requirement Specification created by Novartis.
1. Fasta file indexing. The user will specify via a checkbox in the search parameters
dialog (default to on) to use an indexed search.
a. Note: the creation of the index will become an automatic step
performed by the sequest remote task runner if the appropriate index for
a given search is not already present. Indexes will not be created or
shown in the protein admin console as described previously—they will
essentially be private to the sequest task runner.
2. Support for running a sequest search
a. using an indexed fasta file, if selected by the user
b. Logging of the sequest step at an appropriate level of detail.
1
Draft 12/22/2010
3.
4.
5.
6.
7.
8.
9.
c. Upon successful execution of the protocol, cleanup the sequest work
directory containing the .DTA and .OUT files, the detailed sequest log
file.
Automated pipeline processing initiation based on a batch file executed to signal
LabKey Server and a csv file included with the raw files to be processed, giving
the identity of the sample and the name of a protocol specifying processing
directives.
Support for cancelling a sequest processing job that has not been started yet or
has been converted to mzXML but not yet sent to the sequest task runner.
New visualization of peptide coverage of the search engine protein within a run.
New visualization of peptide set overlaps between runs using a Venn diagram
Merging of result files for Mudpit experiments that generate multiple raw files
for a single sample, as directed by a column in the .csv file for run name. If all
sample names in the csv are the same, the set of raw files identified in that csv
will be merged after scoring into a single pep.xml and prot.xml result files. If all
any of the sample names are different, the pep.xml and prot.xml files will be
kept separate. Any combination of some but not all duplicate run names within
a csv will be flagged as an error.
Support for running a conversion server on the same machine as the sequest
scoring, with the whole process from .raw file to loading .pep.xml and .prot.xml
files will be described by one protocol.
Support for setting up and verifying the basic configuration of the sequest
remote task runner via the Admin Console site settings, replacing the current
SequestQueue configuration with equivalent functionality.
More details can be found in the Novartis URS document.1
Fasta Indexes
Fasta indexes are generated by the makedb.exe utility. These indexes basically run the
in-silico digestion process and model spectra generation ahead of time, and are
therefore enzyme-specific. The results are stored in a file which is used by the
sequest.exe utility to make the searching of acquired spectra much faster. The other
tools in the TPP pipeline do not take advantage of indexed fasta files, so they need to be
handed the unindexed fasta path.
The configuration parameters for makedb come from a combination of a
makedb.params file and command line arguments. The file will be generated using
mostly the same tandem.xml values as are used in generating sequest.params
Parameter File values
1
Can we publish this at a public URL along with this spec?
2
Draft 12/22/2010
The makedb.params file has two sections, marked by [MAKEDB] at the top (in place of
[SEQUEST] and a [STATIC MODIFICATIONS] section which does not appear to be marked
separately in a sequest.params file. The source of the parameter values for makedb is
mostly the same as existing parameter handling for sequest.
Sequest and/or makedb
.params properties
n/a
GROUP
NAME
Default
Priority
pipeline
use_index (NEW)
0 = no
R
database_name (makedb)
pipeline
index_name (NEW)
empty
N
same as sequest:
first_database_name
enzyme_info (makedb)
Notes
2
protein
cleavage_site
[RK]|{P}
R
same as sequest:
enzyme_info
Alters scoring job to use an index,
creating it first if it doesn't exist
If use_index=1, an empty value
translates to the default name. see
index name section
A non-empty value should give the full
final file name including extensions.
Translated to enzyme_info the same
way as it does today for
sequest.params. the two values to test
are:
[X]|[X] translates to nonspecific 0 0 - -
min_peptide_mass (makedb)
spectrum
max_peptide_mass (makedb)
spectrum
max_num_internal_
cleavage_sites
scoring
same as sequest:
use_mono/avg_masses
(makedb)
same as sequest:
mass_type_parent
[STATIC MODIFICATIONS]
protein_or_nucleotide_dbase
(makedb)
nucleotide_reading_frames
(makedb)
sort_directory (makedb)
sort_program (makedb)
intermediate_directory
(makedb)
minimum parent
m+h
maximum parent
m+h
maximum_missed_
cleavage_sites
350
R
5000
R
2
N
sequest
mass_type_parent
13
R
residue
(all)
[RK]|{P} translates to trypsin 1 1 KR P
Minimum length of parent peptide to
consider
Maximum length of parent peptide to
consider
maximum value is 5; for enzyme
search; Novartis does not currently use
enzymatic digestion
0=average masses, 1=monoisotopic
masses
N?
n/a
handle the same as for sequest.params;
unclear whether Novartis uses static
mods
NOT CONFIGURABLE (written to param file as fixed values)
n/a
0
Not settable, written as 0 automatically
n/a
n/a
0
Not settable, written as 0 automatically
CONFIGURED AT TASK RUNNER INSTANCE
Empty
P
If empty, defaults to
<databaseroot>\_temp
Empty
P
= sort.exe in the same directory as
sequest.exe4
Empty
P
If empty, becomes
<databaseroot>\_intermediate
2
Priority codes:R = required by client; N="nice to have" generalization, P=one proposed implementation
of a requirement; B=current behavior, removing could be backward compatibility issue
3
The default value of mass_type_parent is incorrectly given as 0 in the docs, should be 1.
3
Draft 12/22/2010
4
Draft 12/22/2010
Command line syntax for sequest pipeline tools
Program
Makedb
Mzxml2search
Help output
MAKEDB v.5 (rev. 3), (C) 1998-2009
Molecular Biotechnology, Univ. of Washington, J.Eng/S.Morgan/J.Yates
Licensed to Thermo Fisher Scientific Inc.
makedb usage: [options]]
options = -Dstring where string specifies the fasta database to be indexed
-Ostring where string specifies the indexed database to be created
-Pstring where string specifies an alternate parameter file name
-Enumber where number specifies the minimum sequence length
-Snumber where number specifies the maximum sequence length
-F[-/+] Enable(+)/Disable(-) the use of multiple temp files. Default is enable
-U[-/+] Enable(+)/Disable(-) the unique sequence sort. Default is disabled
-I
Display additional info about the indexing process
-H display the help
mzXML2search (TPP v4.3 JETSTREAM rev 1, Build 200909211148 (MSVC))
Usage: mzXML2search [options] *.mzXML
options = -dta or -mgf or -pkl or -xdta or -odta or -ms2 output format (default dta)
-F<num> where num is an int specifying the first scan
-L<num> where num is an int specifying the last scan
-C<n1>[-<n2>] "force charge(s)": where n1 is an integer
specifying the precursor charge state (or possible
charge range from n1 to n2 inclusive) to use; this option
forces input scans to be output with the user-specified
charge (or charge range)
-c<n1>[-<n2>] "suggest charge(s)": for scans which do not have a
precursor charge (or charge range) already determined in the
input file, use the user-specified charge (or charge range)
for those scans. Input scans which already have defined
charge (or charge range) are output with their original,
unchanged values.
-B<num> where num is a float specifying minimum MH+ mass, default=600.0 Da
-T<num> where num is a float specifying maximum MH+ scan, default=4200.0 Da
-P<num> where num is an int specifying minimum peak count, default=5
-I<num> where num is a float specifying minimum threshold for peak intensity, default=0.01
-M<n1>[-<n2>]where n1 is an int specifying MS level to export (default=2)
and n2 specifies an optional range of MS levels to export
-A<str> where str is the activation method, "CID" (default) or "ETD"
if activation method not in scans of mzXML file, this option is ignored
-h
use hydrogen mass for charge ion (default is proton mass)
Sequest.exe
sequest usage: [options] [dtafiles]
options = -Dstring
Where string specifies the database to be searched
-Pstring
Where string specifies an alternate parameter file name
(sequest.params is the default parameters file)
-A
Select correlation algorithm (+ = fft / - = cross product).
(Bioworks uses -A- by default)
-F
Process dta files in the provided directory
-I
Index the database in memory before searching ( + = on / - = off
-R
Reads the dta list from a text file
-U
Use a unified search file
-checklicense Display the Sequest license dialog
-H display the help
For example: sequest *.dta
Out2xml
out2xml(TPP v4.3 JETSTREAM rev 1, Build 200909211148 (MSVC))
usage: Out2XML <path to directory with out files> <# of top hits to report [1, 10]> (OPTIONS)
5
Draft 12/22/2010
OPTIONS:
-m: use monoisotopic precursor weight (default: setting specified in sequest.params)
-a: use average precursor weight (default: setting specified in sequest.params)
-M: maldi mode
-all: output all peptides, don't filter out X containing peptides
-pI: compute and report peptide pI's
-P<path to -including- sequest.params file>: (default) <path to directory with out files>/sequest.params
-E<enzyme>
Where <enzyme> is:
trypsin - Cut: KR, No Cut: P, Sense: C-term (default)
ralphtrypsin - Cut: STKR, No Cut: P, Sense: C-term
stricttrypsin - Cut: KR, No Cut: none, Sense: C-term
argc - Cut: R, No Cut: P, Sense: C-term
aspn - Cut: D, No Cut: none, Sense: N-term
chymotrypsin - Cut: YWFM, No Cut: P, Sense: C-term
cnbr - Cut: M, No Cut: P, Sense: C-term
elastase - Cut: GVLIA, No Cut: P, Sense: C-term
formicacid - Cut: D, No Cut: P, Sense: C-term
gluc - Cut: DE, No Cut: P, Sense: C-term
gluc_bicarb - Cut: E, No Cut: P, Sense: C-term
iodosobenzoate - Cut: W, No Cut: terminal, Sense: C-term
lysc - Cut: K, No Cut: P, Sense: C-term
lysc-p - Cut: K, No Cut: none, Sense: C-term
lysn - Cut: K, No Cut: none, Sense: N-term
lysn_promisc - Cut: KASR, No Cut: none, Sense: N-term
nonspecific - Cut: all, No Cut: none, Sense: N/A
pepsina - Cut: FL, No Cut: terminal, Sense: C-term
protein_endopeptidase - Cut: P, No Cut: terminal, Sense: C-term
staph_protease - Cut: E, No Cut: terminal, Sense: C-term
tca - Cut: KR, No Cut: P, Sense: C-term
- Cut: YWFM, No Cut: P, Sense: C-term
- Cut: D, No Cut: none, Sense: N-term
trypsin/cnbr - Cut: KR, No Cut: P, Sense: C-term
trypsin_gluc - Cut: DEKR, No Cut: P, Sense: C-term
trypsin_k - Cut: K, No Cut: P, Sense: C-term
trypsin_r - Cut: R, No Cut: P, Sense: C-term
6
Draft 12/22/2010
Command line usage in sequest pipeline
This based on batch files that Novartis is currently using.
Program
Command line generated in pipeline
msconvert
makedb
msconvert <rawfile_directory\>*.raw --mzXML5
makedb –O<absolute path to index to be created>
not including ".hdr" or other file type extensions
mzxml2search
(postprocess)
Example:
makedb -OC:\database\_indexed\ipi.HUMAN.v3.70_Betv1a_2010_04_06_v2_NoEnzymeProtocol
MzXML2Search.exe -O<mzXML_directory\><run_name> -dta
<mzXML_directory\><mxXML_BaseFileName>.mzXML
rem following creates the list of .dta files needed by sequest.exe for batch operation
dir /B <mzXML_directory\><run_name>\*.dta > <mzXML_directory\><run_name>\DtaFiles.txt
<generate sequest.params into <mzXML_directory\><run_name> directory>
sequest.exe
(postprocess)
out2xml
sequest.exe -R<mzXML_directory\><run_name>\DtaFiles.txt -F<mzXML_directory\><run_name> X6
rem re generate <mzXML_directory\><run_name>\sequest.params pointing to non-indexed fasta>7
cd <mzXML_directory\>
ren <run_name> <mxXML_BaseFileName>
Out2XML.exe <mxXML_BaseFileName> 1 –E<enzyme name> -all
Command line parameters in the sequest pipeline
As shown in the help screens printouts above, all of these tools have an extensive set of
command options that are not required. In a prior release we added an
"mzxml2search" group and an "out2xml" group of tandem.xml parameters to set these
options when the tools are executed in the pipeline. But the list of these optional
params is already out of date with the versions of the tools we ship. I propose that we
stop adding individual tandem.xml parameters for optional cmd line parameters and
instead just add one generic "cmd_options" property per tool for handling all of the
options we don't set automatically. Also, we would not test these options or their
compatibility with each other or with the required parameters we pass. We test only
that the cmd_options mechanism can set at least one option.
5
Msconvert is currently not working on the LK-SEQUEST machine, complains about a missing DLL.
sequest.exe has some strange windows behavior that these parameter are set to work around. If you
launch sequest with just the –F parameter and *.dta as the files list (as was done with earlier versions of
sequest.ext) you get a windows dialog prompt. The –R parameter points to a file containing a list of dta
files to process. –R successfully avoids the windows prompt and might be useful someday for dealing out
different files lists to different processes. The final "X" character avoids a bug that that appears to be
caused by a command line with no file list.
7
This step makes sure that out2xml has the protein sequences and description strings from the unindexed
fasta.
6
7
Draft 12/22/2010
The net use of these tandem.xml properties to set command options us shown below.
All behavior is as currently implemented except for those marked NEW.
Program
makedb
mzxml2search
Parameters
we set
-O
Who sets
all others
user,
optional
pipeline;
required
-O
-dta
mzXML file
names
-B
-T
-P
-F
-L
-C
-c
-h
sequest
out2xml
pipeline,
required
Priority
R
Notes
N
copied as specified after -O
B
values translated from tandem.xml
(existing behavior)
N
set by sequest task runner
R
all options
except
required
-F
-R
users,
optional
spectrum, minimum parent
m+h
spectrum, maximum parent
m+h
spectrum, minimum peaks
mzxml2search, first scan
mzxml2search, lastscan
mzxml2search, charge
mzxml2search, charge
defaults
mzxml2search, hydrogen
mass
mzxml2search, cmd_options
(NEW)
pipeline,
required
based on directory
conventions
R
simply pasted in between the ones we
currently generate and the mzXML file
name(s)
see footnote for why we set these
all other
options
<path to
out files
dir>
users,
optional
pipeline,
required
sequest, cmd_options (NEW)
N
some may conflict with what we set
directory conventions
R
# of hits to
report
-m or –a
E<enzyme>
-all
pipeline,
required
pipeline,
required
out2xml, top hits (default 1)
B
always required on cmd line,
sequest, mass_type_parent
protein, cleavage_site
B
we pass via sequest.params
pipeline,
optional
user,
optional
user,
options
out2xml, all
B
not sure if this works, says 0 is default but
–all is passed
out2xml, maldi mode
out2xmls, pI
out2xmls, cmd_options
(NEW)
B
-M
-pI
all others
users,
optional
How set (parameters in
tandem.xml)
specified name in tandem.xml
or generated default (see
Index Names)
sequest,
makedb_cmd_options (NEW)
based on directory
conventions
N
8
Draft 12/22/2010
Specifying creation and use of indexes; index names
The "use_index" property is exposed in the UI as a checkbox and defaults (in
default.xml) to false. The effect of this property being set to true will be to put a
process node ahead of the sequest search step that is effectively "CheckCreateIndex". It
will look for the existence of a an index file in its expected <database
location>["\_indexed"] directory. The name it will look for is either specified in the
index_name parameter or a generated default if index_name is empty. If no file is
found, the CheckCreateIndex looks for a marker file matching the index name but with
an additional "._inprocess" extension, indicating an index build in progress. If neither
file is found, a makedb job to create the index is started and a marker file written.
We expect the generated name to be the default usage. The generated name pattern is
< fasta file name>_<protocol_name>.hdr
Any characters in the protocol name that are illegal in file names are replaced by
underscores. The ".hdr" extension is the one that will be written to the sequest.params
file for the sequest.exe task step.
We are building the index name based on the fasta and the search protocol name. This
means that two different protocols will not use the same index by default. This will
avoid the mistake of a mismatch between the search parameters and the index creation
parameters. For example if a search protocol reused an index but specified a cleavage
enzyme or a set of static modifications that were different from the values set when the
index was generated, sequest would produce invalid results but not generate an error.
Creation of a fasta index is an expensive operation, however, and there are several
types of changes to a protocol that would remain compatible with an existing index. To
handle these cases, the protocol author can set the "sequest, index_name" value
explicitly. The user can find out an index name to use by checking the experiment graph
for the fasta index data object.
Manual launching of sequest search jobs
Changes to sequest search dialog
 Check box added after selection of fasta file. Label "Use Database Index"
 Add a radio button or drop-down list to govern the setting of the pipeline,
data_type parameter. The group label should be "Process multiple files as" and
the options should read
o Separate Runs. Each file is scored and loaded as a separate run.
o Combined Run. Each file is scored and then the results are combined
into a single run. Used for MudPIT experiments where files represent
fractions of a single sample.
9
Draft 12/22/2010
o Both

Add feature to save a protocol without running it, for use with automatic
initiation mechanism
Launching jobs from Pipeline/File manager
In the File Manager, users will be able to select one or more .raw files or one or more
.mzXML files to initiate a search manually. This will work the same way as it does today,
with the additional feature of the multiple file handling as described above. If only one
file is selected then it is scored as a separate run only.
Automated launching of sequest search jobs
Automatic launching of search jobs as the files are available will be handled using the
following mechanisms:
SearchProtocols .csv file
A file containing the processing instructions for a set of raw files. The searchprotocols
file will be in CSV format with the following contents:
Column Name
FileName
Column value
the mzXML file name, same as the raw file that
will be posted, but with .mzXML suffix in place
of .raw
Path
Path to the directory where the raw files will be
posted8
InstrumentMethod machine method used in generating the raw file
Position
ordinal row id within the file
ProtocolToRun
the name of the saved search protocol in
LabKey that will be called to process the file
Sample
a string value that identifies a sample
LabkeyFolder
the path to the LabKey project and folder
where the raw files will be processed and
loaded
Example
theFileName.mzXML
C:\pipe_root\project1
Method A
1
SequestNoEnzymeV09
S11
/Project1/SearchFolder2
The file may also contain additional columns defined by the user.
FileUploaded Notification
LabKey will provide a mechanism that can be called from a batch file on the orbitrap
console. The event signals the Pipeline manager that a new file has been written
completely to some folder under the pipeline directory. The parameters passed in this
notification call are the same as the properties FileName, Path, ProtocolToRun, and
LabkeyFolder described above.
8
is Path the directory specified as pipeline sees it? Or is there some path translation?
10
Draft 12/22/2010
The two types of files we expect in the drop directory will have the following argument
values in their notifications
FileUploaded for .raw file
Parameter
Name
FileName
Path
ProtocolToRun
LabkeyFolder
Parameter value
Example Value
the raw file name, with .raw suffix
Path to the directory where the files was posted
the name of the saved protocol in LabKey that will
be called to process the file
the path to the LabKey project and folder where
the file will be processed and loaded
theFileName.raw
C:\pipe_root\project1
msConvertProtocol
/Project1/SearchFolder2
FileUploaded for .csv file
Parameter
Name
FileName
Path
ProtocolToRun
LabkeyFolder
Parameter value
Example
the full csv file name
Path to the directory where the files was posted
the assay definition name of a describe sample
assay.
the path to the LabKey project and folder where
the file will be processed and loaded
SearchParams.csv
C:\pipe_root\project1
DescribeAndCallSearch
/Project1/SearchFolder2
When LabKey receives a notification, it will run the specified protocol on the filename in
the notification, as if the user had selected the single file in the file manager and had
selected ImportData and chosen the ProtocolToRun. The notification mechanism need
only handle one file per notification.
When the CSV file notification comes in, the DescribeSample assay is loaded with the
csv file as the data values. In addition, the ProtocolToRun is called with the set of
FileNames listed in the CSV. If all of the Sample values are the same, the
ProtocolToRun is called once with n files selected. If any of the sample values are
different, the ProtocolToRun is called n times with one file selected each time.
The mzXML files may not all be available at the time the CSV arrives. They may either be
still not uploaded from the MS instrument, or they may still be undergoing conversion.
The search run(s) kicked off by the csv upload must handle this condition and be able to
start the work it can start (e.g. sequest search for a file already through conversion) and
wait for the files listed in the csv but not yet available. The end of a conversion job
(task?) must feed directly into a waiting search job without requiring another external
notification.
11
Draft 12/22/2010
Configuring and Verifying Sequest Pipeline in the Admin Console
tbd
Visualizations
Peptide coverage map
Peptide run comparison pie chart
Summary of proposed changes
Category
generating
parameters and
command line
options
fasta index
Change/Feature
translate new & existing
entries in tandem.xmo
Priority 9
Notes
See Priority column in table
allow explicit setting of index
name
N
useful to override our assumptions about
whether a given index matches a
protocol
manual launch
UI for use_index
UI for setting pipeline,
datatype parameter
same protocol does
Mudpit/individual files based
on # of files selected
R
N
Save protocol
R
csv file is saved and contents
associated with an MS2 run
csv file is treated as a sample
prep data file
loading of csv as sample prep
assay triggers search job
submission
FileUploaded event
R
handling different sequences
of raw and csv files getting
posted or missing post
R
automated
launch
P
P
P
P
need some mechanism for csv to ask for
MudPIT or separate file scoring; we said
based on sample names being all
identical, not changing protocol name
used for generating a protocol name to
go in csv
Seems like the right way to link csv
properties to MS2 results
correspondence between File Manager
selections and csv columns seems simpler
to test and generally useful
equivalence to UI upload + import seems
generally useful
99
Priority codes:R = required by client; N="nice to have" generalization of client requirement,
P=proposed implementation of a client requirement
12
Draft 12/22/2010
Download