2. Sequencing technologies 2.4. Illumina

advertisement
BCM-2002
Concepts and methods in
sequencing and genome assembly
B. Franz LANG, Département de Biochimie
Bureau: H307-15
Courrier électronique: Franz.Lang@Umontreal.ca
Outline
1.
2.
3.
4.
Concepts in DNA and RNA sequencing
Sequencing technologies
Random genome sequencing, with/without cloning
Data formats of results – autoradiograms, traces, fastq
and base call qualities
5. Sequencing and assembly artifacts
1. Concepts in DNA and RNA sequencing
Reminder
• DNA and RNA are polar (5’P; 3’OH), charged biopolymers, made
up of nucleotides.
• By convention, sequences are always written from 5’ (left) to 3’
(right); otherwise, the polarity has to be indicated.
• DNA usually occurs in double-stranded, antiparallel perfectly
base-paired form:
5’ AGCTATTGATTTCCTTGG 3’
3’ TCGATAACTAAAGGAACC 5’
• RNAs are most often single-stranded and may form
secondary and tertiary base-pairs (intra-molecular,
or with other molecules). Single-stranded DNA does
the same. For sequencing, DNAs and RNAs have to
be denatured and single-stranded, without structure.
1. Concepts in DNA and RNA sequencing
Principles; see also Maniatis (a popular biochemistry cook-book):
The initial two sequencing techniques are the enzymatic synthesis method
of Sanger et al. (1977) and the chemical degradation method of Maxam
and Gilbert (1977).
Note that Maxam and Gilbert is slow, using highly toxic/cancerogenic
substances, and no longer used - except for special applications such as
mapping of protein binding to DNA. New Generation Sequencing (NGS)
techniques have taken over for genome projects – see below. They do not
require electrophoretic techniques but use instead various nanotechnological approaches. The currently by far most popular technology
is Illumina.
1. Concepts in DNA and RNA sequencing
Principle:
Although very different in principle, both Maxam/Gilbert and Sanger
produce populations of (radio- or fluorochrome-) labeled oligonucleotides
that all start at the same site of a given DNA/RNA, and that end in a given
nucleotide (G,A,T/U,C) that is generated with a given sequencing
biochemistry (nucleotide-specific termination of DNA synthesis, or
nucleotide-specific cleavage; etc.).
Cleavage at random meG site
========== >
‘Visible’ radioactive
fragments
Note that in any sequencing technology, only the labeled singlestranded DNAs or RNAs are sequenced; unlabeled material does not
matter. When more molecules carry the same label,
these need to be first separated (e.g., by electrophoresis).
1. Concepts in DNA and RNA sequencing
Electrophoretic separation, and detection principles:
These populations of oligonucleotides are then resolved by electrophoresis
under conditions that discriminate size differences at the single nucleotide
level (PAGE). When loaded into four adjacent lanes of a sequencing gel, the
order of nucleotides can be read directly from an image after visualizing the
radioactive or any other label (see below). When sequence reactions are
marked with four different fluorescent dyes (the current standard of Sanger
sequencing), these can be loaded on a single lane (or capillary), and read
automatically and continuously as different-wavelength light emissions,
generated by laser excitation.
1. Concepts in DNA and RNA sequencing
Principles of RNA sequencing:
RNA is sequenced similar to DNA, either directly by chemical methods
(Maxam-Gilbert-like, yet inefficient, slow), by a Sanger-like synthesis protocol
with reverse transcriptase (to produce cDNA sequence ladders), or after
transformation to cDNA by regular DNA sequencing procedures (Sanger or
NGS technologies).
RNA classes may be separated by size (micro RNAs, tRNAs rRNAs …) or by
enrichment of eukaryotic mRNAs carrying a 3’ poly-A, by purification with an
oligo-dT column. That is, RNA sequencing may provide more information than
just the primary sequence.
Most RNAs have distinct start and processing sites. High volume RNA
sequencing (NGS, called RNA-seq) allows precise identification of starts and
stops, and measurement of relative quantities (i.e., quantitative mapping of
RNA 5’ and 3’ ends).
2.
Sequencing technologies
2.1. Maxam and Gilbert (chemical)
• Requires high amount of highly purified DNA fragments (e.g.,
restriction fragments).
• Single radioactive label, can be on double- or single-stranded DNA.
• Nucleotide-specific, partial chemical modification (random along DNA).
• Chemical cleavage at modified nucleotides.
• Denaturation (heat, formamide), to allow uniform electrophoresis of
single-stranded DNA molecules that are perfectly linear and without
secondary structure (if not – sequencing artifacts).
• High-resolution slab gel PAGE, followed by autoradiography.
• Reading (up to a few hundred nt/reaction) usually by a human expert.
• Several days labor with a few gel runs provides at best 10 kbp sequence
2.1. Maxam-Gilbert
sequencing – summary
Slow, many DNA purification steps,
requires lots of DNA, toxic reagents,
no automation available, relatively
short reads up to a few hundred.
2.
Sequencing technologies
2.2. Sanger (enzymatic synthesis)
• Unique start of sequencing ladder is determined by a sequencing primer,
hybridized to DNA or RNA. Purity of template is not an issue (!), a huge
advantage.
• DNA polymerase (reverse transcriptase for RNA) for primer elongation.
• Nucleotide-specific termination (random) with one of four dideoxynucleotides that are mixed with the four regular nucleotides.
2.
Sequencing technologies
2.2. Sanger (enzymatic synthesis)
• Label may be radioactive or a fluorescent dye on
• Primer itself (e.g., 5’ P32; dye label added during primer synthesis).
• Nucleotides incorporated during synthesis (e.g., P32, S35).
• Dideoxy-nucleotides (different dyes emitting different colors –
single lane or capillary sequencing is possible : current standard).
• High-resolution slab gel or capillary electrophoresis
• Autoradiography or automated reading of migrating fragments (laser,
with camera or diodes).
• Several days labor may produce ~100 kbp sequence. Robotic
procedures for template purification and sequence reactions allows
scale-up.
2.2. Sanger (enzymatic synthesis), summary
In this example, the primer is labeled, therefore requiring gel separation in four lanes
2.
Sequencing technologies
2.3. 454 Technology – Roche GS FLX pyrosequencing
(several hundred MB per run; advantage: reads up to 1,000 nt)
Pyrosequencing is a method of DNA sequencing (determining the order of
nucleotides in DNA) based on the "sequencing by synthesis" principle. It differs
from Sanger sequencing, in that it relies on the detection of pyrophosphate release
on nucleotide incorporation, rather than chain termination with
dideoxynucleotides.
Originally the leader in NGS, this technology is less effective and more error-prone
than Illumina, and will therefore be abandoned by the company in a few years !
2.
Sequencing technologies
2.3. 454 Technology – Roche GS FLX
2.
Sequencing technologies
2.3. 454 Technology – Roche GS FLX
(multiplex reaction in oil emulsion droplets)
2.
Sequencing technologies
2.3. 454 Technology – Roche GS FLX
2.
Sequencing technologies
2.3. 454 Technology – Roche GS FLX
• DNA polymerase incorporates the correct, complementary dNTPs onto the template. This incorporation releases
pyrophosphate (PPi) stoichiometrically.
• ATP sulfurylase quantitatively converts PPi to ATP in the presence of adenosine 5´ phosphosulfate. This ATP acts as
fuel to the luciferase-mediated conversion of luciferin to oxyluciferin that generates visible light
• Unincorporated nucleotides and ATP are degraded by apyrase, and the reaction can restart.
2.
Sequencing technologies
2.4. Illumina (several GB per run; reads up to 300 nt)
2.
Sequencing technologies
2.4. Illumina
2.
Sequencing technologies
2.4. Illumina
2.
Sequencing technologies
2.4. Illumina
Base calling example for two clusters
2.
Sequencing technologies
2.5. ABI SOLiD – sequencing by ligation
(2,000 MB per run; but only 35 nt/read)
A library of DNA fragments, ligated with universal sequence adaptors, is attached to the
surface of magnetic beads (one fragment per bead). Emulsion PCR taking place in microreactors amplifies the fragments that are then covalently bound to a glass slide.
SOLiD technology applies a rather complicated ligation/cleavage procedure. Partially
degenerate, fluorescently labeled DNA octamers with dinucleotide sequence recognition cores
are hybridized to the template, and perfectly annealing sequences are ligated to the primer.
After imaging, unextended strands are capped and fluorophores are cleaved. Repetitions of
new priming, primer removal, and ligation cycles will in the end cover a stretch of 35 nt twice
(redundantly), which improves the accuracy of base calling. Yet the value of a 35 nt reading
starts dwindling ,in face of other NGS technologies producing longer reads almost every year
(e.g., Illumina promising 300 nt for 2014).
First cycle
cleavage
2.
Sequencing technologies
…. and so on ….
2.
Sequencing technologies
2.6. Ion Torrent (100 MB + per run; up to 200 nt/read)
Incorporation of a deoxyribonucleotide triphosphate (dNTP) into a primed, growing DNA
strand involves the release of pyrophosphate, and a hydrogen ion that s measured on a
semiconductor chip.
Microwells each containing one single-stranded template DNA molecule plus a DNA
polymerase are sequentially flooded with A, C, G or T. Only if an introduced dNTP is
complementary to the next unpaired nucleotide on the template strand it is incorporated into
the growing complementary strand. If more than one nucleotides follow each other, the signal
strength correlates with the number of identical incorporated nucleotides.
The series of electrical pulses is translated into a DNA sequence, without intermediate signal
conversion, the use of labeled nucleotides, or error-prone intermediate amplification steps.
However, the signal precision is lower than with 454, Illumina, and Solid technologies.
2.
Sequencing technologies
2.7. Pacific Biosciences (+/- 3,000 nt/read, up to 15,000)
The PacBio RS II is a single molecule, real-time DNA sequencing system that provides the
longest read lengths of any available sequencing technology, however in comparison to all
other NGS technologies it has the lowest precision. Sequencing occurs on SMRT Cells,
each containing thousands of Zero-Mode Waveguides (ZMWs) in which polymerases are
immobilized. The ZMWs provide a way for directly watching DNA polymerase with a
high-resolution camera, as it performs sequencing by synthesis (fluorescence
measurement; four different flurochrome-labeled nucleotides).
The long read length is precious for the assembly of genomes, in particular in regions
containing long sequence repeats that cause otherwise problems in genome assembly. In
addition, it detects DNA base modifications using the kinetics of the polymerization
reaction during sequencing.
2.
Sequencing technologies
2.7. Pacific Biosciences (+/- 3,000 nt/read, up to 15,000)
2. Sequencing technologies – comparison from 2012
Quail et al. BMC Genomics 2012, 13:341
3. Random genome sequencing comparison
3.1. Sanger, Maxam Gilbert - with cloning – DNA is not amplified in vitro,
therefore has no DNA amplification artifacts. Each clone receives original
piece of DNA in a plasmid that is multiplied by E. coli.
3. Random genome sequencing comparison
3.2. NGS procedures without cloning , using either DNAs attached to nanochips (micro wells) or in oil drop emulsion.
454, Illumina, Solid – DNA is highly PCR-amplified. Errors may therefore come from PCR
amplification artifacts.
Pacific Biosciences and Ion Torrent technologies both read single molecules directly without prior
PCR amplification. Yet in contrast. their relatively high error rate is due to the signal imprecision
itself.
4. Data formats of results – autoradiograms, traces, fastq
and base call qualities
Trace file typical for Sanger sequencing with base call qualities indicated by
the height of blue bars and Q numbers. The advantage of this format is easy
spotting of artifacts by a human expert. The typical NGS format (FastQ) only
reports the sequence plus the quality encoded in machine readable format.
4. Data formats of results – quality scores in
fastq format
Typical NGS format (FastQ) only reports the sequence plus the quality
encoded in machine readable format.
4. Data formats of results – quality scores
4. Data formats of results – quality scores
4. Data formats of results – quality scores
(Illumina example)
5. Artifacts in sequencing and sequence assembly
The denatured DNA is not linear as it folds back on itself
and then migrates differently on the sequencing gel
(Sanger)
– reason: secondary structures, mainly in G+C –rich regions
– effect: ‘compression’ zones in the sequencing ladder
– solutions (i) sequence DNA in the two directions of complementary
strands; sequencing artifacts due to folding are not symmetric; (ii) for
Sanger sequencing, use nucleotide analogs that minimize secondary
structure folding, like deaza-NTP, deaza-dITP, or ITP ( instead of
NTPs or dGTP, respectively)
5. Artifacts in sequencing and assembly
Sequencing ladders terminate prematurely or contain ‘holes’
Reasons:
• sequencing reactions over-modified (M&G), or too elevated terminator
concentrations (Sanger);
• (ii) strong nucleotide bias, like long runs of A or T that cause many
polymerases to fall of the template (Sanger)
5. Artifacts in sequencing and assembly
Uncertain number of identical nucleotides in a row
(homopolymers; > 6)
Reasons:
• Amplification errors by DNA polymerase (Illumina, 454)
• Signal ambiguity when estimating the number of identical nucleotides
from the height of a single signal (Illumina, very high error with 454)
5. Artifacts in sequencing and assembly
Readings that only partially fit a genome sequence (one of
the worst artifacts)
Reasons:
• Ligation of separate pieces into one fragment, during primer ligation
(applies to all technologies using primer ligation)
• Partial deletion of sequence during PCR at repeat sequence and folded
structures (all technologies using PCR amplification)
This is it, folks!
Download