11.1 Pipeline Integrations Background Starting in version 8.3, LabKey Server has featured a configurable pipeline, typically referred to as the Enterprise Pipeline. It can run user-defined executables (tasks) as part of a single workflow (pipeline). Tasks can be assigned to specific computing resources, including the web server, remote non-cluster computers, and clusters through Globus GRAM. In the 11.1 release, we will configure the pipeline and build out support for three customer workflows. Sequest Background Our current Sequest integration was generously contributed by an external developer. Its implementation was patterned off the original approach for converting raw mass spec files into mzXML. It is a separate Tomcat webapp. LabKey Server talks to it over HTTP to upload files for analysis and download the result files. The pipeline job runs inside the web server and polls the Sequest webapp to determine when the job is complete. Approach We will take the existing Sequest webapp code and convert it to run as an Enterprise Pipeline remote task. It will use the same file copying techniques as other Enterprise Pipeline jobs (expecting the files to be mounted as a file system, instead of copied over HTTP). It will no longer occupy a thread on the web server to poll the job status on the remote machine. We will continue to need custom Java code for this job type because we will need to generate a .params file based on the search protocol. Feature Extractor Background The Katze lab also has a custom Tomcat webapp patterned on the original conversion server code. It runs an Agilent microarray tool called the feature extractor. Approach We will take the existing Sequest webapp code and convert it to run as an Enterprise Pipeline remote task. It will use the same file copying techniques as other Enterprise Pipeline jobs (expecting the files to be mounted as a file system, instead of copied over HTTP). It will no longer occupy a thread on the web server to poll the job status on the remote machine. We can hopefully use the XML configuration already available to execute the tool. MacCoss Pipeline Prototype Background The MacCoss lab is currently using a custom-developed Perl pipeline to process their mass spec data. They have their own suite of analysis tools. They are using a cluster to run the analysis, and have their own repository, msDAPL, where the results are loaded. They wish to transition away from their Perl scripts to the Enterprise Pipeline but continue to use their own repository. The first phase of this project is to prototype the pipeline so that LabKey Server can submit jobs to their cluster and usher the results through the entire tool suite. Pipeline Overview 1. RAW goes into makeMS2 2. Charge state determination (optional). Hardclor (takes cms1 file, produces text file after looking for peptide isotope signatures) and Bullseye (takes cms2 and Hardclor output as input, produces two cms2 files based on good/bad split). Bad ms2 scans are thrown out. Bullseye corrects to collapse different peptides isotopes and collapses them into a single corrected peak. Similar/comparable to msprefix. 3. Search. Three options. Typically done as unconstrained, but OK to do as tryptic for demo purposes. Performed twice, one with real database and a separate decoy database. a. Sequest (only this for proof of concept). Output is two .sqt file - a text file, one decoy and one fake. b. Library (.tsv output and SQLLite) c. Crux (can produce .sqt or other output types) 4. Convert .sqt to XML using sqt2pin, with two .sqt as inputs, and a single output 5. Percolator (takes pin as input, writes one XML file as output). This step combines multiple outputs (ala fraction search) 6. Ping msDapl to load file - Percolator output, FASTA, spectra Approach We will be given access to a development machine that is inside the UW firewall and can submit jobs to the cluster. We will use a combination of XML configuration and custom Java code to run the various tools. Some tools take a parameters file that will need to generated based on the input files and search configuration. General Improvements Although these improvements are not required to meet the scenarios described above, they would be greatly beneficial to administrators and users of the Enterprise Pipeline and should be implemented as time allows. Improved Analysis Protocol Definition UI Our native format for defining analysis protocols is the same as X!Tandem’s, an XML file of the following format: <?xml version="1.0" encoding="UTF-8"?> <bioml> <note label="protein, cleavage site" type="input">[KR]|{P}</note> <note label="pipeline, data type" type="input">both</note> … </bioml> We have a GUI that lets users specify a small subset of the supported parameters for MS2 jobs. All the other parameters, and all of the parameters for other job types, must be defined by the user in XML. A separate spec covers a rich user experience for editing these parameters in a generic, configurable way, but the addition of a simple table UI to set parameters would be an improvement. We’d have a two-tabbed editor. One would be the existing text area where the user enters HTML. The other would be a simple two column table UI that would let users add parameters as name-value pairs. Improved Pipeline Configuration Current XML pipeline configuration is done with Spring XML files. These have the advantage of being administrator editable, but they are not simple or easy to author. You must know fully qualified Java class names and property names to be successful. There is no XSD that enumerates the available options – you must use an IDE like IntelliJ to get any kind of useful statement completion. We should support other tool configuration formats, like GenePattern’s. It may not allow for exactly the same set of the expressive capabilities, but it covers many of the standard usages with a much easier format than our existing Spring format.