Dale Beach, Longwood University Lisa Scheifele, Loyola University Maryland Next-generation sequencing has revolutionized both biological research and clinical medicine, with sequencing of entire human genomes being used to predict drug responsiveness and to diagnose disease (for example Choi 2009). In contrast to traditional Sanger sequencing, next-generation sequencing datasets have shorter read lengths and higher error rates. This can create challenges for downstream analysis since even a small error rate will result in a large number of sequencing reads that contain errors due to the abundance of sequencing reads. Indeed, Illumina MiSeq data produces reads with an error rate of 0.1% (Glenn 2011), yet this corresponds to only ~85% of the 150 bp sequencing reads (.999150) being error-free. Sequencing error in read http://www.pnas.org/content/106/45/19096/F3.expansion.html http://www.pnas.org/content/106/45/19096.full.pdf+html This module is designed for a genetics or molecular biology class. It will require 3 lecture/seminar class periods with optional additional Linux-based lab activities Prior to beginning this module, students should be familiar with: Sample preparation techniques for DNA sequencing DNA replication and the enzymes that synthesize DNA Nucleic acid and nucleotide structure Completed small eukaryotic genome data on Illumina platform If students will not be performing command-line programming themselves, this data should be analyzed with: Jellyfish to produce data on k-mer frequencies that students can use to generate a histogram in Excel Quake to perform error correction so that students can be provided with pre- and post-error correction datasets Initial evaluation of the quality of eukaryotic genome sequencing data Implementation of error correction techniques Comparison of the quality of sequencing data before and after error correction At the completion of this module, students will be able to: Describe the important differences between highthroughput and traditional (low throughput) experiments Explain the reasons for variations in the quality of highthroughput datasets Utilize computational tools to quantify errors in sequencing data Interpret the quality of a sequencing experiment and be able to implement effective quality control measures Excel or other Analytical packages to create a k-mer frequency distribution Galaxy to create a boxplot of PHRED33 scores Optional: Quake and Jellyfish on Linux system to generate k-mer data and perform error correction This module will develop students’ abilities to: Apply the process of science ▪ Design experiment from methodological design through data analysis ▪ Analyze and interpret data Ability to use modeling and simulation ▪ Design experimental strategies and predict outcomes Ability to use quantitative reasoning ▪ Depict data using histograms and boxplots ▪ Interpret graphs and use the results of their analysis to modify error correction strategies Intro to sequencing history and platforms Discuss typical sources of error in sequencing reads Discuss sequence output formats and PHRED33 scores Upload raw data to Galaxy Optional: Quake in Linux to manipulate parameters and improve quality http://www.nimr.mrc.ac.uk/mill-hillessays/bringing-it-all-back-home-nextgeneration-sequencing-technology-and-you# Introduce software packages that can be used to assess data quality Demonstrate breaking sequencing reads into k-mers Use Excel or Jellyfish to create k-mer graph Use Excel or Jellyfish to create k-mer graph following manipulation of error correction parameters (variations in k-mer size) K-mer frequency distibution Discussion of using PHRED33 scores to assess data quality Create boxplots of PHRED33 scores in Galaxy for raw data Create boxplots of PHRED33 scores in Galaxy for data post Quake correction can have students compare outcomes following Quake correction with different parameters Raw Data Data post Quake correction Why has next-generation sequencing technology led to a revolution in biology/medicine? Discuss and predict how chemical and physical mechanisms lead to errors Comparison of sequence improvement based on different parameters How do software packages determine which base is in error and which is correct if sequencing reads conflict? Why is it important to have a numerical measure of error in addition to the nucleotide sequence? This module will be performed as a team-based project with students preparing and handing in a report at the end. Students will be able to: Predict predominant types or sources of error based on experimental design and sequencing platform Prepare a boxplot using Galaxy for an exemplary dataset and use the boxplot to evaluate the quality of the sequence data Effectively improve the quality of any set of NGS reads prior to assembly https://banana-slug.soe.ucsc.edu/bioinformatic_tools:jellyfish www.en.wikipedia.org/wiki/FASTQ_format Kenney DR, Schatz MC, Salzberg SL. 2010. Quake:qualityaware detection and correction of sequencing errors. Genome Biology 11:R116 Marcais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 27:764-770. [Jellyfish program] http://res.illumina.com/documents/products/techspotlights/tec hspotlight_sequencing.pdf http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2581791/pdf/uk mss-2586.pdf