Enhanced MS2 SEQUEST Pipeline and Visualizations Peter Hussey LabKey Software Draft 2.1: December 22, 2010 Introduction This document describes several enhancements to the MS2 pipeline implementation that uses the SEQUEST scoring algorithm. These enhancement are funded by Novartis AG , and will be developed as part of the LabKey Server open source project. In this project, the sequest.exe utility will be run as a "remote task runner" in Enterprise Pipeline terminology, similar to a conversion server. We will also support indexed fasta files as created by the makedb.exe utility. These two changes from the current SequestQueue implementation should result in significantly improved throughput of the analysis queue, enough to keep up with the rate of raw file generation by the Orbitrap mass spec machine used by Novartis. Other enhancements will include A mechanism that can be incorporated into a batch file that notifies LabKey Server of a specific file posted to a pipeline directory. The mechanism initiates processing on the file using a specified protocol. A Novartis-specific mechanism for using the CSV file generated by the Orbirtrap operator to specify the sample handling and the intended search protocol for the set of raw files in a pipeline directory. New visualizations available from Run Details and Run Compare views. This project will support the sequest.exe and makedb.exe utilities that come with Proteome Discoverer 1.1 from Thermo Fisher and on Windows 7 Professional. Key Scenarios All scenarios come from the User Requirement Specification created by Novartis. 1. Fasta file indexing. The user will specify via a checkbox in the search parameters dialog (default to on) to use an indexed search. a. Note: the creation of the index will become an automatic step performed by the sequest remote task runner if the appropriate index for a given search is not already present. Indexes will not be created or shown in the protein admin console as described previously—they will essentially be private to the sequest task runner. 2. Support for running a sequest search a. using an indexed fasta file, if selected by the user b. Logging of the sequest step at an appropriate level of detail. 1 Draft 12/22/2010 3. 4. 5. 6. 7. 8. 9. c. Upon successful execution of the protocol, cleanup the sequest work directory containing the .DTA and .OUT files, the detailed sequest log file. Automated pipeline processing initiation based on a batch file executed to signal LabKey Server and a csv file included with the raw files to be processed, giving the identity of the sample and the name of a protocol specifying processing directives. Support for cancelling a sequest processing job that has not been started yet or has been converted to mzXML but not yet sent to the sequest task runner. New visualization of peptide coverage of the search engine protein within a run. New visualization of peptide set overlaps between runs using a Venn diagram Merging of result files for Mudpit experiments that generate multiple raw files for a single sample, as directed by a column in the .csv file for run name. If all sample names in the csv are the same, the set of raw files identified in that csv will be merged after scoring into a single pep.xml and prot.xml result files. If all any of the sample names are different, the pep.xml and prot.xml files will be kept separate. Any combination of some but not all duplicate run names within a csv will be flagged as an error. Support for running a conversion server on the same machine as the sequest scoring, with the whole process from .raw file to loading .pep.xml and .prot.xml files will be described by one protocol. Support for setting up and verifying the basic configuration of the sequest remote task runner via the Admin Console site settings, replacing the current SequestQueue configuration with equivalent functionality. More details can be found in the Novartis URS document.1 Fasta Indexes Fasta indexes are generated by the makedb.exe utility. These indexes basically run the in-silico digestion process and model spectra generation ahead of time, and are therefore enzyme-specific. The results are stored in a file which is used by the sequest.exe utility to make the searching of acquired spectra much faster. The other tools in the TPP pipeline do not take advantage of indexed fasta files, so they need to be handed the unindexed fasta path. The configuration parameters for makedb come from a combination of a makedb.params file and command line arguments. The file will be generated using mostly the same tandem.xml values as are used in generating sequest.params Parameter File values 1 Can we publish this at a public URL along with this spec? 2 Draft 12/22/2010 The makedb.params file has two sections, marked by [MAKEDB] at the top (in place of [SEQUEST] and a [STATIC MODIFICATIONS] section which does not appear to be marked separately in a sequest.params file. The source of the parameter values for makedb is mostly the same as existing parameter handling for sequest. Sequest and/or makedb .params properties n/a GROUP NAME Default Priority pipeline use_index (NEW) 0 = no R database_name (makedb) pipeline index_name (NEW) empty N same as sequest: first_database_name enzyme_info (makedb) Notes 2 protein cleavage_site [RK]|{P} R same as sequest: enzyme_info Alters scoring job to use an index, creating it first if it doesn't exist If use_index=1, an empty value translates to the default name. see index name section A non-empty value should give the full final file name including extensions. Translated to enzyme_info the same way as it does today for sequest.params. the two values to test are: [X]|[X] translates to nonspecific 0 0 - - min_peptide_mass (makedb) spectrum max_peptide_mass (makedb) spectrum max_num_internal_ cleavage_sites scoring same as sequest: use_mono/avg_masses (makedb) same as sequest: mass_type_parent [STATIC MODIFICATIONS] protein_or_nucleotide_dbase (makedb) nucleotide_reading_frames (makedb) sort_directory (makedb) sort_program (makedb) intermediate_directory (makedb) minimum parent m+h maximum parent m+h maximum_missed_ cleavage_sites 350 R 5000 R 2 N sequest mass_type_parent 13 R residue (all) [RK]|{P} translates to trypsin 1 1 KR P Minimum length of parent peptide to consider Maximum length of parent peptide to consider maximum value is 5; for enzyme search; Novartis does not currently use enzymatic digestion 0=average masses, 1=monoisotopic masses N? n/a handle the same as for sequest.params; unclear whether Novartis uses static mods NOT CONFIGURABLE (written to param file as fixed values) n/a 0 Not settable, written as 0 automatically n/a n/a 0 Not settable, written as 0 automatically CONFIGURED AT TASK RUNNER INSTANCE Empty P If empty, defaults to <databaseroot>\_temp Empty P = sort.exe in the same directory as sequest.exe4 Empty P If empty, becomes <databaseroot>\_intermediate 2 Priority codes:R = required by client; N="nice to have" generalization, P=one proposed implementation of a requirement; B=current behavior, removing could be backward compatibility issue 3 The default value of mass_type_parent is incorrectly given as 0 in the docs, should be 1. 3 Draft 12/22/2010 4 Draft 12/22/2010 Command line syntax for sequest pipeline tools Program Makedb Mzxml2search Help output MAKEDB v.5 (rev. 3), (C) 1998-2009 Molecular Biotechnology, Univ. of Washington, J.Eng/S.Morgan/J.Yates Licensed to Thermo Fisher Scientific Inc. makedb usage: [options]] options = -Dstring where string specifies the fasta database to be indexed -Ostring where string specifies the indexed database to be created -Pstring where string specifies an alternate parameter file name -Enumber where number specifies the minimum sequence length -Snumber where number specifies the maximum sequence length -F[-/+] Enable(+)/Disable(-) the use of multiple temp files. Default is enable -U[-/+] Enable(+)/Disable(-) the unique sequence sort. Default is disabled -I Display additional info about the indexing process -H display the help mzXML2search (TPP v4.3 JETSTREAM rev 1, Build 200909211148 (MSVC)) Usage: mzXML2search [options] *.mzXML options = -dta or -mgf or -pkl or -xdta or -odta or -ms2 output format (default dta) -F<num> where num is an int specifying the first scan -L<num> where num is an int specifying the last scan -C<n1>[-<n2>] "force charge(s)": where n1 is an integer specifying the precursor charge state (or possible charge range from n1 to n2 inclusive) to use; this option forces input scans to be output with the user-specified charge (or charge range) -c<n1>[-<n2>] "suggest charge(s)": for scans which do not have a precursor charge (or charge range) already determined in the input file, use the user-specified charge (or charge range) for those scans. Input scans which already have defined charge (or charge range) are output with their original, unchanged values. -B<num> where num is a float specifying minimum MH+ mass, default=600.0 Da -T<num> where num is a float specifying maximum MH+ scan, default=4200.0 Da -P<num> where num is an int specifying minimum peak count, default=5 -I<num> where num is a float specifying minimum threshold for peak intensity, default=0.01 -M<n1>[-<n2>]where n1 is an int specifying MS level to export (default=2) and n2 specifies an optional range of MS levels to export -A<str> where str is the activation method, "CID" (default) or "ETD" if activation method not in scans of mzXML file, this option is ignored -h use hydrogen mass for charge ion (default is proton mass) Sequest.exe sequest usage: [options] [dtafiles] options = -Dstring Where string specifies the database to be searched -Pstring Where string specifies an alternate parameter file name (sequest.params is the default parameters file) -A Select correlation algorithm (+ = fft / - = cross product). (Bioworks uses -A- by default) -F Process dta files in the provided directory -I Index the database in memory before searching ( + = on / - = off -R Reads the dta list from a text file -U Use a unified search file -checklicense Display the Sequest license dialog -H display the help For example: sequest *.dta Out2xml out2xml(TPP v4.3 JETSTREAM rev 1, Build 200909211148 (MSVC)) usage: Out2XML <path to directory with out files> <# of top hits to report [1, 10]> (OPTIONS) 5 Draft 12/22/2010 OPTIONS: -m: use monoisotopic precursor weight (default: setting specified in sequest.params) -a: use average precursor weight (default: setting specified in sequest.params) -M: maldi mode -all: output all peptides, don't filter out X containing peptides -pI: compute and report peptide pI's -P<path to -including- sequest.params file>: (default) <path to directory with out files>/sequest.params -E<enzyme> Where <enzyme> is: trypsin - Cut: KR, No Cut: P, Sense: C-term (default) ralphtrypsin - Cut: STKR, No Cut: P, Sense: C-term stricttrypsin - Cut: KR, No Cut: none, Sense: C-term argc - Cut: R, No Cut: P, Sense: C-term aspn - Cut: D, No Cut: none, Sense: N-term chymotrypsin - Cut: YWFM, No Cut: P, Sense: C-term cnbr - Cut: M, No Cut: P, Sense: C-term elastase - Cut: GVLIA, No Cut: P, Sense: C-term formicacid - Cut: D, No Cut: P, Sense: C-term gluc - Cut: DE, No Cut: P, Sense: C-term gluc_bicarb - Cut: E, No Cut: P, Sense: C-term iodosobenzoate - Cut: W, No Cut: terminal, Sense: C-term lysc - Cut: K, No Cut: P, Sense: C-term lysc-p - Cut: K, No Cut: none, Sense: C-term lysn - Cut: K, No Cut: none, Sense: N-term lysn_promisc - Cut: KASR, No Cut: none, Sense: N-term nonspecific - Cut: all, No Cut: none, Sense: N/A pepsina - Cut: FL, No Cut: terminal, Sense: C-term protein_endopeptidase - Cut: P, No Cut: terminal, Sense: C-term staph_protease - Cut: E, No Cut: terminal, Sense: C-term tca - Cut: KR, No Cut: P, Sense: C-term - Cut: YWFM, No Cut: P, Sense: C-term - Cut: D, No Cut: none, Sense: N-term trypsin/cnbr - Cut: KR, No Cut: P, Sense: C-term trypsin_gluc - Cut: DEKR, No Cut: P, Sense: C-term trypsin_k - Cut: K, No Cut: P, Sense: C-term trypsin_r - Cut: R, No Cut: P, Sense: C-term 6 Draft 12/22/2010 Command line usage in sequest pipeline This based on batch files that Novartis is currently using. Program Command line generated in pipeline msconvert makedb msconvert <rawfile_directory\>*.raw --mzXML5 makedb –O<absolute path to index to be created> not including ".hdr" or other file type extensions mzxml2search (postprocess) Example: makedb -OC:\database\_indexed\ipi.HUMAN.v3.70_Betv1a_2010_04_06_v2_NoEnzymeProtocol MzXML2Search.exe -O<mzXML_directory\><run_name> -dta <mzXML_directory\><mxXML_BaseFileName>.mzXML rem following creates the list of .dta files needed by sequest.exe for batch operation dir /B <mzXML_directory\><run_name>\*.dta > <mzXML_directory\><run_name>\DtaFiles.txt <generate sequest.params into <mzXML_directory\><run_name> directory> sequest.exe (postprocess) out2xml sequest.exe -R<mzXML_directory\><run_name>\DtaFiles.txt -F<mzXML_directory\><run_name> X6 rem re generate <mzXML_directory\><run_name>\sequest.params pointing to non-indexed fasta>7 cd <mzXML_directory\> ren <run_name> <mxXML_BaseFileName> Out2XML.exe <mxXML_BaseFileName> 1 –E<enzyme name> -all Command line parameters in the sequest pipeline As shown in the help screens printouts above, all of these tools have an extensive set of command options that are not required. In a prior release we added an "mzxml2search" group and an "out2xml" group of tandem.xml parameters to set these options when the tools are executed in the pipeline. But the list of these optional params is already out of date with the versions of the tools we ship. I propose that we stop adding individual tandem.xml parameters for optional cmd line parameters and instead just add one generic "cmd_options" property per tool for handling all of the options we don't set automatically. Also, we would not test these options or their compatibility with each other or with the required parameters we pass. We test only that the cmd_options mechanism can set at least one option. 5 Msconvert is currently not working on the LK-SEQUEST machine, complains about a missing DLL. sequest.exe has some strange windows behavior that these parameter are set to work around. If you launch sequest with just the –F parameter and *.dta as the files list (as was done with earlier versions of sequest.ext) you get a windows dialog prompt. The –R parameter points to a file containing a list of dta files to process. –R successfully avoids the windows prompt and might be useful someday for dealing out different files lists to different processes. The final "X" character avoids a bug that that appears to be caused by a command line with no file list. 7 This step makes sure that out2xml has the protein sequences and description strings from the unindexed fasta. 6 7 Draft 12/22/2010 The net use of these tandem.xml properties to set command options us shown below. All behavior is as currently implemented except for those marked NEW. Program makedb mzxml2search Parameters we set -O Who sets all others user, optional pipeline; required -O -dta mzXML file names -B -T -P -F -L -C -c -h sequest out2xml pipeline, required Priority R Notes N copied as specified after -O B values translated from tandem.xml (existing behavior) N set by sequest task runner R all options except required -F -R users, optional spectrum, minimum parent m+h spectrum, maximum parent m+h spectrum, minimum peaks mzxml2search, first scan mzxml2search, lastscan mzxml2search, charge mzxml2search, charge defaults mzxml2search, hydrogen mass mzxml2search, cmd_options (NEW) pipeline, required based on directory conventions R simply pasted in between the ones we currently generate and the mzXML file name(s) see footnote for why we set these all other options <path to out files dir> users, optional pipeline, required sequest, cmd_options (NEW) N some may conflict with what we set directory conventions R # of hits to report -m or –a E<enzyme> -all pipeline, required pipeline, required out2xml, top hits (default 1) B always required on cmd line, sequest, mass_type_parent protein, cleavage_site B we pass via sequest.params pipeline, optional user, optional user, options out2xml, all B not sure if this works, says 0 is default but –all is passed out2xml, maldi mode out2xmls, pI out2xmls, cmd_options (NEW) B -M -pI all others users, optional How set (parameters in tandem.xml) specified name in tandem.xml or generated default (see Index Names) sequest, makedb_cmd_options (NEW) based on directory conventions N 8 Draft 12/22/2010 Specifying creation and use of indexes; index names The "use_index" property is exposed in the UI as a checkbox and defaults (in default.xml) to false. The effect of this property being set to true will be to put a process node ahead of the sequest search step that is effectively "CheckCreateIndex". It will look for the existence of a an index file in its expected <database location>["\_indexed"] directory. The name it will look for is either specified in the index_name parameter or a generated default if index_name is empty. If no file is found, the CheckCreateIndex looks for a marker file matching the index name but with an additional "._inprocess" extension, indicating an index build in progress. If neither file is found, a makedb job to create the index is started and a marker file written. We expect the generated name to be the default usage. The generated name pattern is < fasta file name>_<protocol_name>.hdr Any characters in the protocol name that are illegal in file names are replaced by underscores. The ".hdr" extension is the one that will be written to the sequest.params file for the sequest.exe task step. We are building the index name based on the fasta and the search protocol name. This means that two different protocols will not use the same index by default. This will avoid the mistake of a mismatch between the search parameters and the index creation parameters. For example if a search protocol reused an index but specified a cleavage enzyme or a set of static modifications that were different from the values set when the index was generated, sequest would produce invalid results but not generate an error. Creation of a fasta index is an expensive operation, however, and there are several types of changes to a protocol that would remain compatible with an existing index. To handle these cases, the protocol author can set the "sequest, index_name" value explicitly. The user can find out an index name to use by checking the experiment graph for the fasta index data object. Manual launching of sequest search jobs Changes to sequest search dialog Check box added after selection of fasta file. Label "Use Database Index" Add a radio button or drop-down list to govern the setting of the pipeline, data_type parameter. The group label should be "Process multiple files as" and the options should read o Separate Runs. Each file is scored and loaded as a separate run. o Combined Run. Each file is scored and then the results are combined into a single run. Used for MudPIT experiments where files represent fractions of a single sample. 9 Draft 12/22/2010 o Both Add feature to save a protocol without running it, for use with automatic initiation mechanism Launching jobs from Pipeline/File manager In the File Manager, users will be able to select one or more .raw files or one or more .mzXML files to initiate a search manually. This will work the same way as it does today, with the additional feature of the multiple file handling as described above. If only one file is selected then it is scored as a separate run only. Automated launching of sequest search jobs Automatic launching of search jobs as the files are available will be handled using the following mechanisms: SearchProtocols .csv file A file containing the processing instructions for a set of raw files. The searchprotocols file will be in CSV format with the following contents: Column Name FileName Column value the mzXML file name, same as the raw file that will be posted, but with .mzXML suffix in place of .raw Path Path to the directory where the raw files will be posted8 InstrumentMethod machine method used in generating the raw file Position ordinal row id within the file ProtocolToRun the name of the saved search protocol in LabKey that will be called to process the file Sample a string value that identifies a sample LabkeyFolder the path to the LabKey project and folder where the raw files will be processed and loaded Example theFileName.mzXML C:\pipe_root\project1 Method A 1 SequestNoEnzymeV09 S11 /Project1/SearchFolder2 The file may also contain additional columns defined by the user. FileUploaded Notification LabKey will provide a mechanism that can be called from a batch file on the orbitrap console. The event signals the Pipeline manager that a new file has been written completely to some folder under the pipeline directory. The parameters passed in this notification call are the same as the properties FileName, Path, ProtocolToRun, and LabkeyFolder described above. 8 is Path the directory specified as pipeline sees it? Or is there some path translation? 10 Draft 12/22/2010 The two types of files we expect in the drop directory will have the following argument values in their notifications FileUploaded for .raw file Parameter Name FileName Path ProtocolToRun LabkeyFolder Parameter value Example Value the raw file name, with .raw suffix Path to the directory where the files was posted the name of the saved protocol in LabKey that will be called to process the file the path to the LabKey project and folder where the file will be processed and loaded theFileName.raw C:\pipe_root\project1 msConvertProtocol /Project1/SearchFolder2 FileUploaded for .csv file Parameter Name FileName Path ProtocolToRun LabkeyFolder Parameter value Example the full csv file name Path to the directory where the files was posted the assay definition name of a describe sample assay. the path to the LabKey project and folder where the file will be processed and loaded SearchParams.csv C:\pipe_root\project1 DescribeAndCallSearch /Project1/SearchFolder2 When LabKey receives a notification, it will run the specified protocol on the filename in the notification, as if the user had selected the single file in the file manager and had selected ImportData and chosen the ProtocolToRun. The notification mechanism need only handle one file per notification. When the CSV file notification comes in, the DescribeSample assay is loaded with the csv file as the data values. In addition, the ProtocolToRun is called with the set of FileNames listed in the CSV. If all of the Sample values are the same, the ProtocolToRun is called once with n files selected. If any of the sample values are different, the ProtocolToRun is called n times with one file selected each time. The mzXML files may not all be available at the time the CSV arrives. They may either be still not uploaded from the MS instrument, or they may still be undergoing conversion. The search run(s) kicked off by the csv upload must handle this condition and be able to start the work it can start (e.g. sequest search for a file already through conversion) and wait for the files listed in the csv but not yet available. The end of a conversion job (task?) must feed directly into a waiting search job without requiring another external notification. 11 Draft 12/22/2010 Configuring and Verifying Sequest Pipeline in the Admin Console tbd Visualizations Peptide coverage map Peptide run comparison pie chart Summary of proposed changes Category generating parameters and command line options fasta index Change/Feature translate new & existing entries in tandem.xmo Priority 9 Notes See Priority column in table allow explicit setting of index name N useful to override our assumptions about whether a given index matches a protocol manual launch UI for use_index UI for setting pipeline, datatype parameter same protocol does Mudpit/individual files based on # of files selected R N Save protocol R csv file is saved and contents associated with an MS2 run csv file is treated as a sample prep data file loading of csv as sample prep assay triggers search job submission FileUploaded event R handling different sequences of raw and csv files getting posted or missing post R automated launch P P P P need some mechanism for csv to ask for MudPIT or separate file scoring; we said based on sample names being all identical, not changing protocol name used for generating a protocol name to go in csv Seems like the right way to link csv properties to MS2 results correspondence between File Manager selections and csv columns seems simpler to test and generally useful equivalence to UI upload + import seems generally useful 99 Priority codes:R = required by client; N="nice to have" generalization of client requirement, P=proposed implementation of a client requirement 12 Draft 12/22/2010