Error Correction in HighThroughput Datasets

advertisement
Dale Beach, Longwood University
Lisa Scheifele, Loyola University Maryland
Next-generation sequencing has
revolutionized both biological
research and clinical medicine, with
sequencing of entire human
genomes being used to predict drug
responsiveness and to diagnose
disease (for example Choi 2009).
In contrast to traditional Sanger sequencing, next-generation sequencing datasets
have shorter read lengths and higher error rates. This can create challenges for
downstream analysis since even a small error rate will result in a large number of
sequencing reads that contain errors due to the abundance of sequencing reads.
Indeed, Illumina MiSeq data produces reads with an error rate of 0.1% (Glenn 2011), yet
this corresponds to only ~85% of the 150 bp sequencing reads (.999150) being error-free.
Sequencing error in read
http://www.pnas.org/content/106/45/19096/F3.expansion.html
http://www.pnas.org/content/106/45/19096.full.pdf+html

This module is designed for a genetics or
molecular biology class. It will require 3
lecture/seminar class periods with optional
additional Linux-based lab activities

Prior to beginning this module, students should
be familiar with:
 Sample preparation techniques for DNA sequencing
 DNA replication and the enzymes that synthesize DNA
 Nucleic acid and nucleotide structure

Completed small eukaryotic
genome data on Illumina
platform

If students will not be
performing command-line
programming themselves, this
data should be analyzed with:
 Jellyfish to produce data on k-mer
frequencies that students can use
to generate a histogram in Excel
 Quake to perform error correction
so that students can be provided
with pre- and post-error correction
datasets

Initial evaluation of the
quality of eukaryotic
genome sequencing
data

Implementation of error
correction techniques

Comparison of the
quality of sequencing
data before and after
error correction

At the completion of this module, students will be able
to:
 Describe the important differences between highthroughput
and traditional (low throughput) experiments
 Explain the reasons for variations in the quality of
highthroughput datasets
 Utilize computational tools to quantify errors in sequencing
data
 Interpret the quality of a sequencing experiment and be able
to implement effective quality control measures

Excel or other Analytical packages to create a
k-mer frequency distribution

Galaxy to create a boxplot of PHRED33 scores

Optional: Quake and Jellyfish on Linux
system to generate k-mer data and perform
error correction

This module will develop students’ abilities to:
 Apply the process of science
▪ Design experiment from methodological design through data analysis
▪ Analyze and interpret data
 Ability to use modeling and simulation
▪ Design experimental strategies and predict outcomes
 Ability to use quantitative reasoning
▪ Depict data using histograms and boxplots
▪ Interpret graphs and use the results of their analysis to modify error
correction strategies

Intro to sequencing history and
platforms

Discuss typical sources of error in
sequencing reads

Discuss sequence output formats
and PHRED33 scores

Upload raw data to Galaxy

Optional: Quake in Linux to
manipulate parameters and
improve quality
http://www.nimr.mrc.ac.uk/mill-hillessays/bringing-it-all-back-home-nextgeneration-sequencing-technology-and-you#

Introduce software packages
that can be used to assess
data quality

Demonstrate breaking
sequencing reads into k-mers

Use Excel or Jellyfish to create
k-mer graph

Use Excel or Jellyfish to create
k-mer graph following
manipulation of error
correction parameters
(variations in k-mer size)
K-mer frequency distibution

Discussion of using
PHRED33 scores to assess
data quality

Create boxplots of
PHRED33 scores in Galaxy
for raw data

Create boxplots of
PHRED33 scores in Galaxy
for data post Quake
correction
 can have students
compare outcomes
following Quake correction
with different parameters
Raw Data
Data post Quake correction

Why has next-generation sequencing technology led to a
revolution in biology/medicine?

Discuss and predict how chemical and physical mechanisms
lead to errors

Comparison of sequence improvement based on different
parameters

How do software packages determine which base is in error
and which is correct if sequencing reads conflict?

Why is it important to have a numerical measure of error in
addition to the nucleotide sequence?

This module will be performed as a team-based
project with students preparing and handing in a
report at the end. Students will be able to:
 Predict predominant types or sources of error based on
experimental design and sequencing platform
 Prepare a boxplot using Galaxy for an exemplary dataset and
use the boxplot to evaluate the quality of the sequence data
 Effectively improve the quality of any set of NGS reads prior
to assembly

https://banana-slug.soe.ucsc.edu/bioinformatic_tools:jellyfish

www.en.wikipedia.org/wiki/FASTQ_format

Kenney DR, Schatz MC, Salzberg SL. 2010. Quake:qualityaware detection and correction of sequencing errors. Genome
Biology 11:R116

Marcais G, Kingsford C. 2011. A fast, lock-free approach for
efficient parallel counting of occurrences of k-mers.
Bioinformatics. 27:764-770. [Jellyfish program]

http://res.illumina.com/documents/products/techspotlights/tec
hspotlight_sequencing.pdf

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2581791/pdf/uk
mss-2586.pdf
Download