Bioinformatics Resource Center (BRC) Operating Procedures and Policies Please thoroughly review these policies and procedures. There are three sections: Standard Analysis of Next-Generation Sequences generated at UWBC User analyses conducted using UWBC hardware and/or software Custom sequence analyses performed by BRC If you have any questions regarding this document or any other aspect of BRC services, please contact us at brc@biotech.wisc.edu . I. Standard Analysis of Next-Generation Sequences generated at UWBC BRC provides standard data processing for next-generation (Roche GS-FLX and Illumina) sequencing performed at the UWBC DNA Sequencing Facility. The cost for these standard informatics services is included in the sequencing cost. It is strongly advised that you meet with both the Next-gen sequencing and the BRC staff during the planning stage of your sequencing runs. It is critical that you fill in the informatics section of the Illumina or Roche submission forms. If you have any questions about the BRC part of the sequencing submission form please contact BRC before your run is performed. You will be notified by sequencing staff when your run is underway. BRC personnel will begin informatics processing of the data as soon the run has completed on the instrument. Usually data will be available within 5 business days of the completion of the Illumina pipeline. Standard processing differs between sequencing platforms, and the associated software is rapidly changing. BRC uses the most up to date versions of standard software that can be robustly integrated into the pipeline at the time service is provided. Illumina data processing: The Illumina software pipeline Cassava is used to generate sequence data. If a suitable reference genome is available, the ELAND component of the pipeline will be used to align the reads of your sample to that genome, largely as a means to assess the quality of the run. Typically, alignments are performed using default settings unless otherwise indicated on the BRC job submission form. Following the completion of the pipeline, a standard quality control (QC) tool is used to produce a report containing diagnostic information such as the number of reads, average base-calling quality and the presence of contaminants such as Illumina adapters and other over-represented sequences. Using the QC report and a summary HTML document produced by the pipeline, and in some cases other metrics, BRC and Next-gen sequencing staff will jointly decide whether the run met quality standards. If the run was successful, BRC will notify you directly that your sequences are ready. If the QC report indicates a problem with your run, you will be notified by the Next-gen sequencing staff in order to discuss the appropriate steps to take next. BRC will not automatically trim or remove contaminants such as the phiX-174 DNA that is used as a control with Illumina sequencing. In the case of small RNA sequencing, and other very short-sequence protocols, adapters will be removed. Such trimming will not be performed with data generated by other protocols without explicit permission. Additionally, we will not trim the end of reads to remove bases with low base-calling quality. Such value-added activities have to be explicitly requested and are not covered by the charges assessed by the DNA Sequencing Facility. Deliverables for Illumina GAIIx data: The DNA Sequencing facility aims for a minimum of 12 million "passing filter" reads/lane for libraries prepared by UWBC. When you are notified that your run has succeeded, BRC staff will provide instructions for downloading your sequences along with the Illumina summary.htm file and the QC reports. It is possible to arrange for other mechanisms of data delivery, for example on a portable hard drive or a USB thumb drive. The default format for data delivery is FASTQ with Illumina 1.3-style quality scores. If requested, sequences can also be provided in fasta or ELAND-export format. It is also possible to render ELAND alignments in ELAND “export”, “sorted” or SAM format. However, to avoid additional service charges, the choice of formats other than fastq or ELAND-export should be indicated on the bioinformatics section of the Illumina submission form. This is because the Illumina pipeline has to be configured for alternate output formats before it is run. It is possible to run additional ELAND jobs, at the standard BRC rate, to produce output in other formats, or new alignments with different parameters, subject to the availability of your run folder. For this reason, you must request additional ELAND jobs within four weeks of the run completion to ensure the run folder is still available. All files will be compressed in either gz or zip format. Please note that you typically will need a minimum of 2 GB for each lane of compressed Illumina GAIIx data. Please make sure to look at all data files within one week of receiving them and inform the BRC or DNA Sequencing staff of any concerns. Reference genome selection: Reference genomes for ELAND should be specified on the informatics section of your sequencing request submission form. If a suitable reference genome is not available, then no alignment can be performed, although this will have no impact on the sequence data that you receive. If a specified genome is widely used as a reference genome, BRC may already have formatted a copy that is ready for use with ELAND. In some cases, it is difficult to determine the exact genome that you desire. Please specify an unambiguous source for each reference genome either in the form of a GenBank accession number or an exact URL. Take care to specify which version of the genome you wish to use (e.g. mm8 or mm9 for mouse). If you cannot find a genome for the exact strain represented by your samples, choose the closest relative that can be found or specify “no reference” on your submission form. In some cases the standard names of chromosomes (or contigs) in reference genomes are not in a form that you find desirable (e.g. a GenBank accession number rather than the word “chr1”). If you do not want to use a standard version of a genome you must indicate that on your sample submission form and provide an alternative sequence file (in fasta format) or unambiguous download information. Illumina and Roche GA-FLX data storage: We do not provide long-term data storage. Deliverable sequence files will be kept on backup for 6 months and the entire sequencing run folder will be archived for 4 weeks from the date you receive your data. Two weeks prior to deleting a run folder, a reminder email will be sent to the individual who submitted the samples. DISCLAIMER: This is an emergency backup only - if you would like a copy of your run folder, you must provide a properly formatted hard drive to BRC staff within 3 weeks of receiving your sequences. Typically, a capacity of at least 1 TB should be sufficient for Illumina run folders. GA-FLX folders usually require on the order of 30-100 GB, depending on the job. Roche GS-FLX data processing and Quality Control: The DNA Sequencing Facility, in conjunction with BRC staff, will perform the basic data processing for GS-FLX runs as part of the run cost. This includes signal processing, basecalling and quality control for GS-FLX runs. The data storage policies mentioned in the previous section apply to GS-FLX data. II. Analyses Conducted by Users on UWBC Workstations BRC and the UWBC provide hardware and software for users to conduct their own analyses. In order to take advantage of these resources, users must request a linux account at the BRC computer lab. Account set-up costs $50. The following policies and guidelines must be followed: 1. This resource is dedicated to users performing their own analyses. If users require any training on the software this will be charged at $50 per hour of staff time. 2. Users must use our Google calendar to schedule their sessions. Your reservation should cover the run time for the analyses and not just the time you will be at the workstation. BRC staff can work with you on particularly demanding jobs so that they can be run outside of normal working hours. 3. Users need to inform us which of these categories best fits their data analyses: A. Data generated solely at the UWBC DNA Sequencing Facility or the UWBC Gene Expression Center. B. Data generated at the UWBC that is necessarily integrated with data generated elsewhere. C. Data was totally generated outside of the UWBC. If demand exceeds capacity, priority will be given to users in category A followed by B and lastly C. BRC staff will proactively monitor usage and inform users in as timely a manner as possible if access may be limited or delayed. 4. Users may not install software of any kind on any workstations associated with this resource. If a user has a need for a particular software package they should contact BRC staff. 5. We cannot provide long-term storage of data analysis files. Users will inform BRC staff when their analyses have been completed and discuss plans for data retrieval and/or storage. The user is expected to retrieve their data and plan to store the files at their home location. Users will be notified of any analysis folders/files that have been inactive for over 3 months. These will be deleted two weeks after the notification and a confirmation from the user that they received the message. 6. This resource is not available to clients unaffiliated with the UW-System. External and/or non-academic users should work with BRC staff on a fee-for-service basis if they require data analysis support. This includes non UW-System clients whose data was solely generated at the UWBC. Available software: The BRC computer lab has a growing collection of software available for users to analyze Illumina and Roche GS-FLX data. These include the CLC-Bio Genomics Workbench, a comprehensive sequence analysis package with an easy-to-use graphical user interface. Other popular tools include the mapping programs MAQ and Bowtie and de novo assemblers such as Newbler, Velvet, and ABySS in addition to various toolboxes such as Samtools, BedTools and the FastX toolkit. We can install additional analysis tools on request, as long as they are freely available. It is the user's responsibility to underwrite the cost of commercial or other non-free software that is not provided by BRC. There are currently no fees associated with using any software packages. This may change in the future to cover upgrades and new software purchases. III. Custom sequence analyses performed by BRC BRC staff can perform sequence analyses beyond those included as part of standard nextgeneration sequence data processing. These may include sequence data originating on or off campus, generated using most sequencing platforms (e.g. Sanger, 454, Illumina). Users must schedule a consultation with BRC to outline project specifics and formalize a plan by completing a custom project submission form. Services are provided at a heavily subsidized rate of $50/hour, billed in ¼ hour increments. Individual projects will require different levels of effort, and often some initial work needs to be done in order to decide how to best to proceed. Where possible, BRC will attempt to provide a non-binding estimate of the amount of billable time required for your project as well as an estimate of the job's completion date. Examples of types of analyses that BRC has or can do: De novo sequence assembly of bacterial-sized genomes Read mapping Reference assemblies to genomes of any size SNP and INDEL detection RNA-seq analysis Chip-Seq analysis Resequencing / Exon Sequencing Small RNA analysis Sorting reads based on custom (i.e not standard Illumina) barcodes Comparisons of genomes and/or assemblies