Sample Submission:

advertisement
Bioinformatics Resource Center (BRC) Operating Procedures and Policies
Please thoroughly review these policies and procedures. There are three sections:
 Standard Analysis of Next-Generation Sequences generated at UWBC
 User analyses conducted using UWBC hardware and/or software
 Custom sequence analyses performed by BRC
If you have any questions regarding this document or any other aspect of BRC services, please
contact us at brc@biotech.wisc.edu .
I. Standard Analysis of Next-Generation Sequences generated at UWBC
BRC provides standard data processing for next-generation (Roche GS-FLX and Illumina)
sequencing performed at the UWBC DNA Sequencing Facility. The cost for these standard
informatics services is included in the sequencing cost. It is strongly advised that you meet with
both the Next-gen sequencing and the BRC staff during the planning stage of your sequencing
runs. It is critical that you fill in the informatics section of the Illumina or Roche submission
forms. If you have any questions about the BRC part of the sequencing submission form please
contact BRC before your run is performed.
You will be notified by sequencing staff when your run is underway. BRC personnel will begin
informatics processing of the data as soon the run has completed on the instrument. Usually
data will be available within 5 business days of the completion of the Illumina pipeline.
Standard processing differs between sequencing platforms, and the associated software is
rapidly changing. BRC uses the most up to date versions of standard software that can be
robustly integrated into the pipeline at the time service is provided.
Illumina data processing: The Illumina software pipeline Cassava is used to generate
sequence data. If a suitable reference genome is available, the ELAND component of the
pipeline will be used to align the reads of your sample to that genome, largely as a means to
assess the quality of the run. Typically, alignments are performed using default settings unless
otherwise indicated on the BRC job submission form. Following the completion of the pipeline,
a standard quality control (QC) tool is used to produce a report containing diagnostic information
such as the number of reads, average base-calling quality and the presence of contaminants
such as Illumina adapters and other over-represented sequences. Using the QC report and a
summary HTML document produced by the pipeline, and in some cases other metrics, BRC and
Next-gen sequencing staff will jointly decide whether the run met quality standards.
If the run was successful, BRC will notify you directly that your sequences are ready. If the QC
report indicates a problem with your run, you will be notified by the Next-gen sequencing staff in
order to discuss the appropriate steps to take next.
BRC will not automatically trim or remove contaminants such as the phiX-174 DNA that is used
as a control with Illumina sequencing. In the case of small RNA sequencing, and other very
short-sequence protocols, adapters will be removed. Such trimming will not be performed with
data generated by other protocols without explicit permission. Additionally, we will not trim the
end of reads to remove bases with low base-calling quality. Such value-added activities have to
be explicitly requested and are not covered by the charges assessed by the DNA Sequencing
Facility.
Deliverables for Illumina GAIIx data: The DNA Sequencing facility aims for a minimum of 12
million "passing filter" reads/lane for libraries prepared by UWBC. When you are notified that
your run has succeeded, BRC staff will provide instructions for downloading your sequences
along with the Illumina summary.htm file and the QC reports. It is possible to arrange for other
mechanisms of data delivery, for example on a portable hard drive or a USB thumb drive.
The default format for data delivery is FASTQ with Illumina 1.3-style quality scores. If
requested, sequences can also be provided in fasta or ELAND-export format. It is also possible
to render ELAND alignments in ELAND “export”, “sorted” or SAM format. However, to avoid
additional service charges, the choice of formats other than fastq or ELAND-export should be
indicated on the bioinformatics section of the Illumina submission form. This is because the
Illumina pipeline has to be configured for alternate output formats before it is run. It is possible
to run additional ELAND jobs, at the standard BRC rate, to produce output in other formats, or
new alignments with different parameters, subject to the availability of your run folder. For this
reason, you must request additional ELAND jobs within four weeks of the run completion to
ensure the run folder is still available.
All files will be compressed in either gz or zip format. Please note that you typically will need a
minimum of 2 GB for each lane of compressed Illumina GAIIx data. Please make sure to look at
all data files within one week of receiving them and inform the BRC or DNA Sequencing staff of
any concerns.
Reference genome selection: Reference genomes for ELAND should be specified on the
informatics section of your sequencing request submission form. If a suitable reference genome
is not available, then no alignment can be performed, although this will have no impact on the
sequence data that you receive. If a specified genome is widely used as a reference genome,
BRC may already have formatted a copy that is ready for use with ELAND. In some cases, it is
difficult to determine the exact genome that you desire. Please specify an unambiguous source
for each reference genome either in the form of a GenBank accession number or an exact URL.
Take care to specify which version of the genome you wish to use (e.g. mm8 or mm9 for
mouse). If you cannot find a genome for the exact strain represented by your samples, choose
the closest relative that can be found or specify “no reference” on your submission form.
In some cases the standard names of chromosomes (or contigs) in reference genomes are not
in a form that you find desirable (e.g. a GenBank accession number rather than the word
“chr1”). If you do not want to use a standard version of a genome you must indicate that on
your sample submission form and provide an alternative sequence file (in fasta format) or
unambiguous download information.
Illumina and Roche GA-FLX data storage: We do not provide long-term data storage.
Deliverable sequence files will be kept on backup for 6 months and the entire sequencing run
folder will be archived for 4 weeks from the date you receive your data. Two weeks prior to
deleting a run folder, a reminder email will be sent to the individual who submitted the samples.
DISCLAIMER: This is an emergency backup only - if you would like a copy of your run folder,
you must provide a properly formatted hard drive to BRC staff within 3 weeks of receiving your
sequences. Typically, a capacity of at least 1 TB should be sufficient for Illumina run folders.
GA-FLX folders usually require on the order of 30-100 GB, depending on the job.
Roche GS-FLX data processing and Quality Control:
The DNA Sequencing Facility, in conjunction with BRC staff, will perform the basic data
processing for GS-FLX runs as part of the run cost. This includes signal processing, basecalling and quality control for GS-FLX runs. The data storage policies mentioned in the previous
section apply to GS-FLX data.
II. Analyses Conducted by Users on UWBC Workstations
BRC and the UWBC provide hardware and software for users to conduct their own analyses. In
order to take advantage of these resources, users must request a linux account at the BRC
computer lab. Account set-up costs $50.
The following policies and guidelines must be followed:
1. This resource is dedicated to users performing their own analyses. If users require any
training on the software this will be charged at $50 per hour of staff time.
2. Users must use our Google calendar to schedule their sessions. Your reservation
should cover the run time for the analyses and not just the time you will be at the
workstation. BRC staff can work with you on particularly demanding jobs so that they
can be run outside of normal working hours.
3. Users need to inform us which of these categories best fits their data analyses:
A. Data generated solely at the UWBC DNA Sequencing Facility or the UWBC
Gene Expression Center.
B. Data generated at the UWBC that is necessarily integrated with data generated
elsewhere.
C. Data was totally generated outside of the UWBC.
If demand exceeds capacity, priority will be given to users in category A followed by B
and lastly C. BRC staff will proactively monitor usage and inform users in as timely a
manner as possible if access may be limited or delayed.
4. Users may not install software of any kind on any workstations associated with this
resource. If a user has a need for a particular software package they should contact
BRC staff.
5. We cannot provide long-term storage of data analysis files. Users will inform BRC
staff when their analyses have been completed and discuss plans for data retrieval
and/or storage. The user is expected to retrieve their data and plan to store the files at
their home location. Users will be notified of any analysis folders/files that have been
inactive for over 3 months. These will be deleted two weeks after the notification and a
confirmation from the user that they received the message.
6. This resource is not available to clients unaffiliated with the UW-System. External and/or
non-academic users should work with BRC staff on a fee-for-service basis if they require
data analysis support. This includes non UW-System clients whose data was solely
generated at the UWBC.
Available software: The BRC computer lab has a growing collection of software available for
users to analyze Illumina and Roche GS-FLX data. These include the CLC-Bio Genomics
Workbench, a comprehensive sequence analysis package with an easy-to-use graphical user
interface. Other popular tools include the mapping programs MAQ and Bowtie and de novo
assemblers such as Newbler, Velvet, and ABySS in addition to various toolboxes such as
Samtools, BedTools and the FastX toolkit. We can install additional analysis tools on request,
as long as they are freely available. It is the user's responsibility to underwrite the cost of
commercial or other non-free software that is not provided by BRC. There are currently no fees
associated with using any software packages. This may change in the future to cover upgrades
and new software purchases.
III. Custom sequence analyses performed by BRC
BRC staff can perform sequence analyses beyond those included as part of standard nextgeneration sequence data processing. These may include sequence data originating on or off
campus, generated using most sequencing platforms (e.g. Sanger, 454, Illumina). Users must
schedule a consultation with BRC to outline project specifics and formalize a plan by completing
a custom project submission form. Services are provided at a heavily subsidized rate of
$50/hour, billed in ¼ hour increments.
Individual projects will require different levels of effort, and often some initial work needs to be
done in order to decide how to best to proceed. Where possible, BRC will attempt to provide a
non-binding estimate of the amount of billable time required for your project as well as an
estimate of the job's completion date.
Examples of types of analyses that BRC has or can do:
De novo sequence assembly of bacterial-sized genomes
Read mapping
Reference assemblies to genomes of any size
SNP and INDEL detection
RNA-seq analysis
Chip-Seq analysis
Resequencing / Exon Sequencing
Small RNA analysis
Sorting reads based on custom (i.e not standard Illumina) barcodes
Comparisons of genomes and/or assemblies
Download