Spec - LabKey Server

advertisement
11.1 Pipeline Integrations
Background
Starting in version 8.3, LabKey Server has featured a configurable pipeline, typically referred to as the
Enterprise Pipeline. It can run user-defined executables (tasks) as part of a single workflow (pipeline).
Tasks can be assigned to specific computing resources, including the web server, remote non-cluster
computers, and clusters through Globus GRAM.
In the 11.1 release, we will configure the pipeline and build out support for three customer workflows.
Sequest
Background
Our current Sequest integration was generously contributed by an external developer. Its
implementation was patterned off the original approach for converting raw mass spec files into mzXML.
It is a separate Tomcat webapp. LabKey Server talks to it over HTTP to upload files for analysis and
download the result files. The pipeline job runs inside the web server and polls the Sequest webapp to
determine when the job is complete.
Approach
We will take the existing Sequest webapp code and convert it to run as an Enterprise Pipeline remote
task. It will use the same file copying techniques as other Enterprise Pipeline jobs (expecting the files to
be mounted as a file system, instead of copied over HTTP). It will no longer occupy a thread on the web
server to poll the job status on the remote machine.
We will continue to need custom Java code for this job type because we will need to generate a .params
file based on the search protocol.
Feature Extractor
Background
The Katze lab also has a custom Tomcat webapp patterned on the original conversion server code. It
runs an Agilent microarray tool called the feature extractor.
Approach
We will take the existing Sequest webapp code and convert it to run as an Enterprise Pipeline remote
task. It will use the same file copying techniques as other Enterprise Pipeline jobs (expecting the files to
be mounted as a file system, instead of copied over HTTP). It will no longer occupy a thread on the web
server to poll the job status on the remote machine.
We can hopefully use the XML configuration already available to execute the tool.
MacCoss Pipeline Prototype
Background
The MacCoss lab is currently using a custom-developed Perl pipeline to process their mass spec data.
They have their own suite of analysis tools. They are using a cluster to run the analysis, and have their
own repository, msDAPL, where the results are loaded. They wish to transition away from their Perl
scripts to the Enterprise Pipeline but continue to use their own repository.
The first phase of this project is to prototype the pipeline so that LabKey Server can submit jobs to their
cluster and usher the results through the entire tool suite.
Pipeline Overview
1. RAW goes into makeMS2
2. Charge state determination (optional). Hardclor (takes cms1 file, produces text file after looking
for peptide isotope signatures) and Bullseye (takes cms2 and Hardclor output as input, produces
two cms2 files based on good/bad split). Bad ms2 scans are thrown out. Bullseye corrects to
collapse different peptides isotopes and collapses them into a single corrected peak.
Similar/comparable to msprefix.
3. Search. Three options. Typically done as unconstrained, but OK to do as tryptic for demo
purposes. Performed twice, one with real database and a separate decoy database.
a. Sequest (only this for proof of concept). Output is two .sqt file - a text file, one decoy and
one fake.
b. Library (.tsv output and SQLLite)
c. Crux (can produce .sqt or other output types)
4. Convert .sqt to XML using sqt2pin, with two .sqt as inputs, and a single output
5. Percolator (takes pin as input, writes one XML file as output). This step combines multiple outputs
(ala fraction search)
6. Ping msDapl to load file - Percolator output, FASTA, spectra
Approach
We will be given access to a development machine that is inside the UW firewall and can submit jobs to
the cluster.
We will use a combination of XML configuration and custom Java code to run the various tools. Some
tools take a parameters file that will need to generated based on the input files and search
configuration.
General Improvements
Although these improvements are not required to meet the scenarios described above, they would be
greatly beneficial to administrators and users of the Enterprise Pipeline and should be implemented as
time allows.
Improved Analysis Protocol Definition UI
Our native format for defining analysis protocols is the same as X!Tandem’s, an XML file of the following
format:
<?xml version="1.0" encoding="UTF-8"?>
<bioml>
<note label="protein, cleavage site" type="input">[KR]|{P}</note>
<note label="pipeline, data type" type="input">both</note>
…
</bioml>
We have a GUI that lets users specify a small subset of the supported parameters for MS2 jobs. All the
other parameters, and all of the parameters for other job types, must be defined by the user in XML.
A separate spec covers a rich user experience for editing these parameters in a generic, configurable
way, but the addition of a simple table UI to set parameters would be an improvement. We’d have a
two-tabbed editor. One would be the existing text area where the user enters HTML. The other would
be a simple two column table UI that would let users add parameters as name-value pairs.
Improved Pipeline Configuration
Current XML pipeline configuration is done with Spring XML files. These have the advantage of being
administrator editable, but they are not simple or easy to author. You must know fully qualified Java
class names and property names to be successful. There is no XSD that enumerates the available options
– you must use an IDE like IntelliJ to get any kind of useful statement completion.
We should support other tool configuration formats, like GenePattern’s. It may not allow for exactly the
same set of the expressive capabilities, but it covers many of the standard usages with a much easier
format than our existing Spring format.
Download