10.3 genotyping

advertisement

Sequencing & Genotyping in LabKey 10.3

Author:

Last updated:

Adam Rauch

November 30, 2010

Background

The O’Connor lab at University of Wisconsin has designed a sequencing workflow that runs samples through a Roche 454 sequencer and (in some cases) processes these runs with a custom genotyping analysis pipeline created in Galaxy, an open-source server that processes genomics data. LabKey Software is adding support for this workflow to LabKey

Server beginning with the 10.3 release. Simon Lank is the designer of the genotyping workflow and our key contact with the O’Connor lab; he has been instrumental in providing details about their pipeline and genotyping in general, and has assisted greatly in the design of the new LabKey capabilities. Ben Bimber has also provided design input to the specification.

The sequencing workflow has a number of important characteristics and key terms that are described and defined below:

 A single run often includes sequences from multiple samples. Each sequence is biologically tagged with a 10-base barcode that identifies which sample it belongs to. After detecting sequences in the sample, the barcode is isolated and removed, and used to join the results with the corresponding sample information. The lab uses 153 unique barcodes (they are reused across runs) and, for simplicity, refers to each by an MID (Multiplex Identifier), an integer assigned to each barcode.

 Running a sample through the sequencer produces an SFF file, a large binary file.

The SFF file is usually converted to a reads file, a TSV file where each row includes a read name, sample MID, sequence, and quality score for that read.

 A sample library is the set of samples that were sequenced in a sequencing run.

 Every reads file and the corresponding metrics files produced by the instrument are imported into LabKey via an automated pipeline.

 In some experiments, genotyping analysis is performed on the sequencing data. An analysis can be performed on the entire sample library or on a subset of samples

(e.g., samples of a particular species). The list of samples, the reads corresponding to those samples, and a subset of DNA reference sequences is sent to the Galaxy server which executes the genotyping workflow.

 In short, the Galaxy genotyping workflow aligns the sequences using BLAT (opensource alignment program) and produces a match table. Each row of this file includes read name, an allele hit for that read, and the sequence length and

1

directional information for that hit. Importantly, a single read can have multiple hits.

A more detailed diagram of the genotyping pipeline, created by Simon, is available here .

The genotyping application runs on the same LabKey server and in the same project as the

Electronic Health Record application; integration between these two applications is an important long-term goal. Galaxy runs on a separate server, but LabKey and Galaxy have full read/write access to the same network file system and the ability to communicate with each other directly via HTTP.

To support the O’Connor genotyping workflow, LabKey 10.3 adds four major new capabilities which are detailed in the sections below.

Manage Reference Sequences and Samples

LabKey adds the ability to manage a master list of allele reference sequences and manage multiple sets of samples used in the workflow.

Manage Reference Sequences

The lab manages a master list of DNA sequences (currently about 16,000 sequences) and associated annotations (species, locus, geographic origin, relations, lineage, etc.) used in genotyping experiments. The lab staff populates a MySQL database with sequences obtained externally or discovered by the lab; they use a Lasso application written by Ben to manage this database. The sequences and their names change periodically and the database is updated fairly regularly (monthly or so). According to Simon, each sequence has a unique id (separate from the name and sequence) that does not change, although sequences occasionally disappear or get merged.

When defining an analysis, the lab selects a subset of sequences to match against (generally a few hundred sequences, for example, all sequences for a given species and locus). When results are uploaded into LabKey they link to their sequences. For these reasons, LabKey maintains and periodically updates a copy of the reference sequence list from the MySQL database. This requires support for MySQL external schemas and providing UI the lab can use to periodically update the sequences.

Managing the master list directly in LabKey is a long-term goal, but it’s not critical for 10.3.

Manage Samples

The lab needs to manage groups of samples and their properties via LabKey, later submitting one with each run. Key properties for each sample include name, 10-base DNA barcode, and MID. For 10.3, the plan is to store sample information in generic lists that

Simon defines and the lab populates; these lists are configured in the module admin UI to allow joining results with sample information.

2

Create Sequencing Runs and Import Reads

The first step is to create a new sequencing run record within LabKey. This might happen before, during, or after the sequencer has processed the sample; therefore, the run can be created at one point and tied to a specific reads file at a later point. The lab stores run meta data in a list they create. The list must include a run number and a sample library number; all other properties are defined by the lab.

LabKey provides two options for importing a sequencing run:

 Automated pipeline. In this scenario, a script on the instrument machine uses the

LabKey API to import metrics files and determine the run number associated with the run. It copies the reads file and metrics files to a network file system accessible to LabKey. It then invokes the import reads action via HTTP, authenticating as an authorized user and specifying the reads file & run number as parameters. This mechanism allows for a fully automated sequencing pipeline.

 Manual import. An authorized user can use the pipeline browser to navigate to a reads file and import it manually. After selecting import, the user selects a run number from a drop-down list that displays all currently unassigned meta data runs.

After initiating the import via either approach, LabKey creates a new import run record

(including path to the reads file, import user, import date, and a link to the meta data record) and imports the reads.

Perform a Genotyping Analysis

Users can initiate genotyping analyses on sequencing runs. A genotyping analysis can be started at any time by clicking the “Add Analysis” button associated with each run. Multiple analyses can be performed on a single run. For convenience, a genotyping analysis can also be initiated as part of the manual import reads process.

After clicking “Add Analysis,” a page appears requesting three pieces of information:

1.

Subset of DNA reference sequences to search. The user selects from a drop-down list containing all custom views that have been defined on the sequences table. The lab can define whatever filters are needed and name them appropriately. Leaving the drop-down blank selects all sequences.

2.

Subset of samples to analyze. By default, all samples are analyzed. A subset of samples can be selected, in which case, only the reads corresponding to those samples will be matched.

3.

An optional description of the analysis.

Initiating multiple genotyping analyses on a single sequencing run, providing different sets of samples and reference sequences on each, provides the flexibility to genotyping multiple species and loci within a single run.

3

Once submitted, LabKey uses the Galaxy web API to create a new data library on the Galaxy server and import the files to that data library. At that point, the user must manually import the files into their current history and start the workflow. (Unfortunately, the initial

Galaxy API does not yet support automating these steps.) When the workflow completes, it automatically signals LabKey to import the results via a custom Galaxy task written in Java.

After confirming validity and completeness of the run, LabKey launches a pipeline job to load the results. The pipeline job loads the matches into a temporary table then executes a single query that:

1.

Groups reads by sample MID and computes total reads per sample, including nonmatches… this is used to calculate the percentage coverage for each allele grouping

2.

Groups matches by read, concatenating multiple matches per read as a commaseparated list of alleles, and computes direction information.

3.

Joins sample names, reads, total reads, and matches, grouping the result by sample and allele grouping, then computing total matches, matches as a percent of total reads, average sequence length, and directional totals.

The result of this query is written to the permanent Matches and Alleles tables, associated with the run and linking alleles to their sequences and sample names to their samples. The temporary table is then deleted.

Analyze Results

Reads can be viewed, filtered, sorted, and exported to standard formats (TSV and Excel) via the LabKey grid UI. Reads can be filtered or grouped by MID, including filtering for reads whose barcode could not be matched to an MID. Reads or filtered subsets of reads can be exported to FASTQ format for analysis in external tools.

Genotyping analyses can be viewed, filtered, sorted, and exported as well. Sample names and properties can be joined with matches. Individual alleles link to their sequence.

Custom queries can be written to view unmatched sequences (sequences corresponding to a genotyped sample that didn’t produce any allele matches). Quality control queries can be written to analyze metrics across time or determine the most effective MIDs.

Additional analysis features will be developed in future versions based on the lab’s requirements.

4

Download