Next generation sequencing

advertisement
Next generation sequencing
Next generation sequencing refers to massively parallel DNA sequencing which brings
the cost per basepair sequence down by orders of magnitude relative to the previous
standard method (Sanger sequencing with four color dye terminators, thermal cycling,
and capillary electrophoresis). Massively parallel sequencing generates 100 Mb to 1 Gb
of short sequence reads in a single experiment at a cost of ~ $15,000. Third generation
sequencing methods are now approaching 3 Gb for $1000. Applications can be divided
into deep sequencing (measuring the amount of gene expression by the number of
cDNAs from an RNA sample that align with each gene in the genome), and de novo
sequencing (sequencing a new genome). The practice of trying to assemble all the exons
of an organism (previously the role of EST sequencing) is called 'exomics'. The
equipment for next generation sequencing was initially expensive, so it was nearly
always done by a company or a core facility that spread the purchase cost over multiple
users. There is some third generation equipment cheap enough that individual labs may
now purchase a 'personal' next generation sequencer. The nature of the experiment is to
generate a large amount of data in a single run. Hence, it is not useful for determinations
that require a small amount of data, e.g. verifying a cloned insert in a plasmid. The
economics often revolve around combining multiple experiments into a single sequencing
run. There are strategies, called 'barcoding', in which several samples are mixed together
in a way that the data can be segregated after the determination.
Deep Sequencing
In deep sequencing, the number of reads mapping to different genes is taken as a measure
of expression. The reads are searched against either the predetermined genome sequence
or a predetermined set of cDNA sequences to assign them to different transcription units.
Because of the very large number of reads to search, speed improvements over BlastN
have been sought to support these assignments. The relevant strategy is to recompose
target sequence as an alphabetized list of oligos. The search then becomes an exercise in
looking up a sequence in an alphabetized list of sequences, which is a much faster
process than searching through the target sequence in its natural order. BlastN also does
such an operation, but doesn't save the aphorized list of subsequences for the target
genome between searches. Applications that perform this operation are provided by each
of the major vendors. The following is a list of three other open source 3rd party
applications that carry out this function.
From Horner et al., Brief Bioinform 11:181-197 (2009); mapping 4,604,890 reads versus
the human transcriptome allowing 2 mismatches.
Bowtie
PASS
SOAP
Reads mapped
4,168,549 (90.52%) 4,183,679 (90.85%) 4,058,196 (88.13%)
Reads
3,987,222 (86.59%) 3,259,096 (70.77%) 3,359,094 (78.36%)
mapped correctly
Processor time
255.9s
1928.0s
78.6s
Another major savings in time is that there is no attempt to align with gaps. Reads that
fail to match within some criterion are simply discarded. Since the number of reads is
generally far in excess of what is needed for the experiment, the discarded reads are not
important.
Depending on the cost, the sequence service provider may provide a guarantee. The
guarantee is usually that some number of reads is guaranteed to be assigned. Since
sequence reads that have errors beyond the assignment threshold are discarded, this is
also a de facto guarantee on the quality of the sequence. In order to get the guarantee,
there is usually some provision that the input RNA must be characterized up to some
standard.
Barcoding
In order to combine samples from multiple investigators, or in a deep sequencing
experiment from multiple cDNA samples, a procedure called "barcoding" is used. In all
the next generation sequencing methods used for deep sequencing and for which
barcoding can be used, the DNA is sheared down to some particular size, adapters are
ligated on, the molecules are diluted to one molecule per reaction unit, and then some
variety of PCR is used with primers complementary to the adapters to amplify enough
DNA within the reaction unit for the sequencing chemistry to be conducted. In
barcoding, one of the adapters used for PCR priming is made longer so that part of it
appears as the beginning of each sequence read. Adapters used for different samples
have different sequences in that part of the adapter. The software that processes the
individual sequence reads will strip these extra bases from the reads, but sort the reads in
different files based on the stripped sequence so that each read is associated with its
original sample. Barcoding requires a separate ligation reaction for each sample, and so
adds a cost per sample. That cost has been dropping, and is now in the $100-$200 range
for each sample.
De Novo Sequencing
In de novo sequencing, there is no pre-existing sequence to search against. So the reads
must be searched against each other to determine overlaps and join them into contiguous
segments of sequence (called 'contigs'). The alignment will be conducted progressively,
with each read that overlaps another being replaced by a contig sequence.
For determination of accuracy, one might compare the individual reads to the consensus
of the contigs. However, some methods make the same mistakes reproducibly. Another
method is to include a segment of DNA with known sequence with which to measure the
error rate. If the sequencing contract includes a guarantee, you have to discuss with the
vendor exactly what is guaranteed. Many consider the guarantee to satisfied if some
number of reads are aligned, and will balk at repeating the determination if you point out
that the assembled consensus for the included control template is full of errors.
A case study with de novo sequencing using 454 pyrosequencing.
From one vendor, sequencing a template of known sequence, we found that the errors
were related to redundancy. All errors were in lengths of runs. The error distribution
was consistent with returned quality scores (see below).
From a second vendor, again with a template of known sequence, we obtained an error
rate of ~ 1/3000 even at a redundancy over 200x. The error distribution was not
consistent with the returned quality scores. The moral of the story is that the quality of
next generation sequencing depends on redundancy, but also on who's doing it.
There are many facilities out there doing deep sequencing, which is highly tolerant of
errors. Those operators haven't been under pressure to optimize accuracy or check that
their returned quality scores are quantitatively accurate. Don't attempt to do de novo
sequencing in a facility with no track record for de novo 2nd generation sequencing.
Quality Scores: In addition to returning the sequence of one or more assembled
consensus sequences (referred to as 'contigs'), the vendor should return a file with a string
of quality scores representing the confidence of each base called in each contig. In
sequencing, quality scores are as follows:
 10: 1 error per 10 nt.
 20: 1 error per 100 nt.
 30: 1 error per 1,000 nt.
 40: 1 error per 10,000 nt.
 etc.
The standard for acceptable de novo sequence is usually an aggregate of 1 error per
100,000 to 1 error per 1,000,000 nt. If the quality scores accurately reflect the likely
position of errors, one can 'finish' with primer directed Sanger sequence to confirm the
sequence at the sites of likely error. In 454 sequencing at adequate redundancy, the
quality scores will likely be at a maximum everywhere except that a base with a low
score will be inserted at the ends of long runs. This is a display showing inclusion of a
confirmatory Sanger sequence read with the 454 data to resolve the length of a run of A's.
The program is Consed (www.phrap.org/phredphrapconsed.html).
Assembly with in-house reads: Our in house sequencing operation uses Beckman
equipment with on-board software for calling bases and assigning read quality values.
Phrap was implemented at the Bioinformatics center for assembly and calculation of
consensus quality values.
Consed was implemented for data display
(www.phrap.org/phredphrapconsed.html). In order to combine the 454 data with inhouse reads the 454 consensus sequences and associated quality values were converted
into the format of a single long sequence read and presented to phrap along with in-house
reads for computation of a joint quality value on the overall consensus.
In this consed view into a region of ambiguous 454 sequence, the 454 consensus
(454contig00086) indicated the possibility of an 8th A in the run of As. The quality
value was 4, which is essentially 50/50. A confirmatory read with Sanger sequencing
(the bottom read) was assembled confirming 7 A's in the run. The confirmatory run
exhibited some ambiguity as to whether there could be a T hiding in the run of A's, but it
definitely counted 7 A's. Phrap estimated the combined quality score for the sequence
AAAAAAAGTT to be 64 (less than one chance per million of an error).
Paired end sequencing
For de novo assembly of a sequence containing repetitive elements, it is necessary to
have some way to pair the sequence on the left and right of each repeat. In first
generation shotgun sequencing, this was done by cloning inserts of a defined size, and
keeping track of the sequence generated from sequencing primers from the left and right
end of each clone. In 454 sequencing, a paired end strategy is to ligate an adapter to the
sized sheared target DNA, ligate to circles, and then capture the circles by the adapter,
break and shear the circle down to size, and proceed to 454 style sequencing across the
adapter region. The Illumina sequencing strategy, which involves flipping the fragment
from one end to the other on the solid support during amplification, is easily modified to
sequence one end of a fragment, and then flip it to sequence the other end. The ABI solid
system also has a paired end sequencing strategy.
Platforms
454 Life Sciences (Roche) has marked two generations of sequencers, of which only the
latter model, GS FLX Titanium, is currently available. A depiction of the 454 chemistry
is shown below (taken from you core course materials):
First fragmented molecules of the target DNA are ligated to adapters so that they can be
subjected to PCR based on primers complementary to those adapters. The adapter-tagged
molecules are diluted so as to get one molecule attached per bead. The beads are then
placed in a confined environment so that PCR can be conducted with the result of
creating enough copies all attached to the same bead to support DNA sequencing.
The beads are deposited one per well in a microtiter plate with millions of wells, each of
which is independently monitored. During sequencing, the dNTPs are washed over the
plate one at a time. In each well were a nucleotide can add, a pyrophosphate is released
which is converted to ATP by reaction with adenosine phosphosulfate (APS). The ATP
is metabolized by luciferase to create a flash of light proportional to the amount of the
pyrophosphate released. The enzyme apyrase is present to destroy residual dNTP prior to
addition of the next dNTP. Because of the central role of pyrophosphate in this
chemistry, the process is called "pyrosequencing".
Modified from Agah et al., Nucleic Acids Res. 32: e166.
This results in a flowgram revealing the sequence as the identity and numbers of each
nucleotide that add as the reagents cycle through dATP, dCTP, dGTP, dTTP.
Fig. 3 from Margulis et al. [2005] Nature 273:376. A flowgram and its correspondence
to the actual sequence.
The weakness of the pyrosequencing approach is that there can be ambiguity in the height
of a flow bar resulting in a misreporting of the number of bases in a run. Margulis et al
reported one residual error per 3000 bp in 20 fold redundant sequence. The errors at this
level of redundancy are essentially all frameshifts. The error per read is about 1/200 nt.
Read length is 200-400 nt.
Illumina GA & GAII
The Illumina sequencing technology (aka. Solexa, or SBS technology) uses reversible
dye terminators. First, individual DNA fragments are isolated and amplified in place
forming clusters (polonies) on a 2D solid support.
In a variation of Sanger sequencing a mixture of all four nucleotides blocked for further
extension by a differently fluorescently colored blocking group is added. The base that is
incorporated is read out by excitation of its fluorophore. The blocking group is removed
allowing continuation of the process to the next base.
Images from http://seqanswers.com
ABI Solid system
The ABI Solid system immobilizes and amplifies DNA fragments on beads much like the
454 system. The beads are deposited on a glass plate. Sequencing is by successive
ligation to differentially colored fluorescent oligonucleotides with specific sequence in
the 1st and 2nd position followed by 3 degenerate bases. This gives sequence
information about the bases 1,2,6,7,11,12... from the primer. Then the extended primer is
stripped and replaced with a primer one base longer. This is repeated 3 more times. The
series of colors generated in all five extensions is then analyzed to produce the sequence.
Image from the ABI web site.
Summary of characteristics of these three systems:
From Horner et al., Brief Bioinform 11:181-197 (2009)
Technology Roche 454
Illumina
Platform
GS 20 FLX Ti
GA
GA II
Reads(M)
0.5
0.5
1
28
100
Read length 100
200
350
35
75
Run time (d) 0.2
5
0.3
0.4
4.5
Images (TB) 0.01
0.03 0.03 0.5
1.7
ABI Solid
1
2
40
115
25
35
6
5
1.8
2.5
3
400
50
6-7
3
New Systems
Ion Torrent
Ion torrent sequencing works much like 454 sequencing, except instead of detecting
released PPi by conversion to a flash of light, the production of H+ ions is sensed using
semiconductor technology. This makes the instrument much cheaper, essentially $50,000
instead of $500,000.
Single molecule sequencing (SMRT)
SMRT is carried out with incorporation of nucleotides that give off a base specific
fluorescent signal on each addition. The templates are distributed one per well (or more
exactly, there is one DNA polymerase per well). The wells are constructed with highly
sensitive sensors so that the signal from one DNA molecule can be read.
A variety of other systems are nearing the market.
Multiple displacement amplification
Multiple displacement amplification is a new method for amplifying genomic DNA prior
to next generation sequencing. It can prepare sufficient DNA for sequencing starting
from a single cell. It is used when variation in the DNA sequence within each individual
cell is of interest. In human cells, this might be relevant, for example, if one were
observing the different DNA rearrangements within individual lymphocytes. In
microbiology, this is used to recover sequences of genomes from bacteria that can not be
cultured.
The method usually involves a cell sorter to sort out an individual cell. The cell is
opened, the DNA is denatured, and then polymerization is carried out with phi29 DNA
polymerase and random hexamer primers. Each extending strand encounters the back of
some other primed unit and displaces it. The displaced strand is also the target for
hexamer priming, which displaces strands which then become targets for more priming,
etc. There is no thermal cycling. The polymerase has proofreading activity and produces
low error rates. The number of polymerizations to the final product is fewer than in PCR
and so fewer errors build up. The error rate can be as low as 10-6. The highly nested
DNA is then disaggregated by digestion with the single strand-specific nuclease S1. This
DNA can then be the subject of next generation sequencing.
Ref: Zhang et al., Nat. Biotechnol. 24:680-686 (2006)
Qiagen sells a kit to conduct MDA.
Image from the Qiagen Web site.
Download