Next generation sequencing Next generation sequencing refers to massively parallel DNA sequencing which brings the cost per basepair sequence down by orders of magnitude relative to the previous standard method (Sanger sequencing with four color dye terminators, thermal cycling, and capillary electrophoresis). Massively parallel sequencing generates 100 Mb to 1 Gb of short sequence reads in a single experiment at a cost of ~ $15,000. Third generation sequencing methods are now approaching 3 Gb for $1000. Applications can be divided into deep sequencing (measuring the amount of gene expression by the number of cDNAs from an RNA sample that align with each gene in the genome), and de novo sequencing (sequencing a new genome). The practice of trying to assemble all the exons of an organism (previously the role of EST sequencing) is called 'exomics'. The equipment for next generation sequencing was initially expensive, so it was nearly always done by a company or a core facility that spread the purchase cost over multiple users. There is some third generation equipment cheap enough that individual labs may now purchase a 'personal' next generation sequencer. The nature of the experiment is to generate a large amount of data in a single run. Hence, it is not useful for determinations that require a small amount of data, e.g. verifying a cloned insert in a plasmid. The economics often revolve around combining multiple experiments into a single sequencing run. There are strategies, called 'barcoding', in which several samples are mixed together in a way that the data can be segregated after the determination. Deep Sequencing In deep sequencing, the number of reads mapping to different genes is taken as a measure of expression. The reads are searched against either the predetermined genome sequence or a predetermined set of cDNA sequences to assign them to different transcription units. Because of the very large number of reads to search, speed improvements over BlastN have been sought to support these assignments. The relevant strategy is to recompose target sequence as an alphabetized list of oligos. The search then becomes an exercise in looking up a sequence in an alphabetized list of sequences, which is a much faster process than searching through the target sequence in its natural order. BlastN also does such an operation, but doesn't save the aphorized list of subsequences for the target genome between searches. Applications that perform this operation are provided by each of the major vendors. The following is a list of three other open source 3rd party applications that carry out this function. From Horner et al., Brief Bioinform 11:181-197 (2009); mapping 4,604,890 reads versus the human transcriptome allowing 2 mismatches. Bowtie PASS SOAP Reads mapped 4,168,549 (90.52%) 4,183,679 (90.85%) 4,058,196 (88.13%) Reads 3,987,222 (86.59%) 3,259,096 (70.77%) 3,359,094 (78.36%) mapped correctly Processor time 255.9s 1928.0s 78.6s Another major savings in time is that there is no attempt to align with gaps. Reads that fail to match within some criterion are simply discarded. Since the number of reads is generally far in excess of what is needed for the experiment, the discarded reads are not important. Depending on the cost, the sequence service provider may provide a guarantee. The guarantee is usually that some number of reads is guaranteed to be assigned. Since sequence reads that have errors beyond the assignment threshold are discarded, this is also a de facto guarantee on the quality of the sequence. In order to get the guarantee, there is usually some provision that the input RNA must be characterized up to some standard. Barcoding In order to combine samples from multiple investigators, or in a deep sequencing experiment from multiple cDNA samples, a procedure called "barcoding" is used. In all the next generation sequencing methods used for deep sequencing and for which barcoding can be used, the DNA is sheared down to some particular size, adapters are ligated on, the molecules are diluted to one molecule per reaction unit, and then some variety of PCR is used with primers complementary to the adapters to amplify enough DNA within the reaction unit for the sequencing chemistry to be conducted. In barcoding, one of the adapters used for PCR priming is made longer so that part of it appears as the beginning of each sequence read. Adapters used for different samples have different sequences in that part of the adapter. The software that processes the individual sequence reads will strip these extra bases from the reads, but sort the reads in different files based on the stripped sequence so that each read is associated with its original sample. Barcoding requires a separate ligation reaction for each sample, and so adds a cost per sample. That cost has been dropping, and is now in the $100-$200 range for each sample. De Novo Sequencing In de novo sequencing, there is no pre-existing sequence to search against. So the reads must be searched against each other to determine overlaps and join them into contiguous segments of sequence (called 'contigs'). The alignment will be conducted progressively, with each read that overlaps another being replaced by a contig sequence. For determination of accuracy, one might compare the individual reads to the consensus of the contigs. However, some methods make the same mistakes reproducibly. Another method is to include a segment of DNA with known sequence with which to measure the error rate. If the sequencing contract includes a guarantee, you have to discuss with the vendor exactly what is guaranteed. Many consider the guarantee to satisfied if some number of reads are aligned, and will balk at repeating the determination if you point out that the assembled consensus for the included control template is full of errors. A case study with de novo sequencing using 454 pyrosequencing. From one vendor, sequencing a template of known sequence, we found that the errors were related to redundancy. All errors were in lengths of runs. The error distribution was consistent with returned quality scores (see below). From a second vendor, again with a template of known sequence, we obtained an error rate of ~ 1/3000 even at a redundancy over 200x. The error distribution was not consistent with the returned quality scores. The moral of the story is that the quality of next generation sequencing depends on redundancy, but also on who's doing it. There are many facilities out there doing deep sequencing, which is highly tolerant of errors. Those operators haven't been under pressure to optimize accuracy or check that their returned quality scores are quantitatively accurate. Don't attempt to do de novo sequencing in a facility with no track record for de novo 2nd generation sequencing. Quality Scores: In addition to returning the sequence of one or more assembled consensus sequences (referred to as 'contigs'), the vendor should return a file with a string of quality scores representing the confidence of each base called in each contig. In sequencing, quality scores are as follows: 10: 1 error per 10 nt. 20: 1 error per 100 nt. 30: 1 error per 1,000 nt. 40: 1 error per 10,000 nt. etc. The standard for acceptable de novo sequence is usually an aggregate of 1 error per 100,000 to 1 error per 1,000,000 nt. If the quality scores accurately reflect the likely position of errors, one can 'finish' with primer directed Sanger sequence to confirm the sequence at the sites of likely error. In 454 sequencing at adequate redundancy, the quality scores will likely be at a maximum everywhere except that a base with a low score will be inserted at the ends of long runs. This is a display showing inclusion of a confirmatory Sanger sequence read with the 454 data to resolve the length of a run of A's. The program is Consed (www.phrap.org/phredphrapconsed.html). Assembly with in-house reads: Our in house sequencing operation uses Beckman equipment with on-board software for calling bases and assigning read quality values. Phrap was implemented at the Bioinformatics center for assembly and calculation of consensus quality values. Consed was implemented for data display (www.phrap.org/phredphrapconsed.html). In order to combine the 454 data with inhouse reads the 454 consensus sequences and associated quality values were converted into the format of a single long sequence read and presented to phrap along with in-house reads for computation of a joint quality value on the overall consensus. In this consed view into a region of ambiguous 454 sequence, the 454 consensus (454contig00086) indicated the possibility of an 8th A in the run of As. The quality value was 4, which is essentially 50/50. A confirmatory read with Sanger sequencing (the bottom read) was assembled confirming 7 A's in the run. The confirmatory run exhibited some ambiguity as to whether there could be a T hiding in the run of A's, but it definitely counted 7 A's. Phrap estimated the combined quality score for the sequence AAAAAAAGTT to be 64 (less than one chance per million of an error). Paired end sequencing For de novo assembly of a sequence containing repetitive elements, it is necessary to have some way to pair the sequence on the left and right of each repeat. In first generation shotgun sequencing, this was done by cloning inserts of a defined size, and keeping track of the sequence generated from sequencing primers from the left and right end of each clone. In 454 sequencing, a paired end strategy is to ligate an adapter to the sized sheared target DNA, ligate to circles, and then capture the circles by the adapter, break and shear the circle down to size, and proceed to 454 style sequencing across the adapter region. The Illumina sequencing strategy, which involves flipping the fragment from one end to the other on the solid support during amplification, is easily modified to sequence one end of a fragment, and then flip it to sequence the other end. The ABI solid system also has a paired end sequencing strategy. Platforms 454 Life Sciences (Roche) has marked two generations of sequencers, of which only the latter model, GS FLX Titanium, is currently available. A depiction of the 454 chemistry is shown below (taken from you core course materials): First fragmented molecules of the target DNA are ligated to adapters so that they can be subjected to PCR based on primers complementary to those adapters. The adapter-tagged molecules are diluted so as to get one molecule attached per bead. The beads are then placed in a confined environment so that PCR can be conducted with the result of creating enough copies all attached to the same bead to support DNA sequencing. The beads are deposited one per well in a microtiter plate with millions of wells, each of which is independently monitored. During sequencing, the dNTPs are washed over the plate one at a time. In each well were a nucleotide can add, a pyrophosphate is released which is converted to ATP by reaction with adenosine phosphosulfate (APS). The ATP is metabolized by luciferase to create a flash of light proportional to the amount of the pyrophosphate released. The enzyme apyrase is present to destroy residual dNTP prior to addition of the next dNTP. Because of the central role of pyrophosphate in this chemistry, the process is called "pyrosequencing". Modified from Agah et al., Nucleic Acids Res. 32: e166. This results in a flowgram revealing the sequence as the identity and numbers of each nucleotide that add as the reagents cycle through dATP, dCTP, dGTP, dTTP. Fig. 3 from Margulis et al. [2005] Nature 273:376. A flowgram and its correspondence to the actual sequence. The weakness of the pyrosequencing approach is that there can be ambiguity in the height of a flow bar resulting in a misreporting of the number of bases in a run. Margulis et al reported one residual error per 3000 bp in 20 fold redundant sequence. The errors at this level of redundancy are essentially all frameshifts. The error per read is about 1/200 nt. Read length is 200-400 nt. Illumina GA & GAII The Illumina sequencing technology (aka. Solexa, or SBS technology) uses reversible dye terminators. First, individual DNA fragments are isolated and amplified in place forming clusters (polonies) on a 2D solid support. In a variation of Sanger sequencing a mixture of all four nucleotides blocked for further extension by a differently fluorescently colored blocking group is added. The base that is incorporated is read out by excitation of its fluorophore. The blocking group is removed allowing continuation of the process to the next base. Images from http://seqanswers.com ABI Solid system The ABI Solid system immobilizes and amplifies DNA fragments on beads much like the 454 system. The beads are deposited on a glass plate. Sequencing is by successive ligation to differentially colored fluorescent oligonucleotides with specific sequence in the 1st and 2nd position followed by 3 degenerate bases. This gives sequence information about the bases 1,2,6,7,11,12... from the primer. Then the extended primer is stripped and replaced with a primer one base longer. This is repeated 3 more times. The series of colors generated in all five extensions is then analyzed to produce the sequence. Image from the ABI web site. Summary of characteristics of these three systems: From Horner et al., Brief Bioinform 11:181-197 (2009) Technology Roche 454 Illumina Platform GS 20 FLX Ti GA GA II Reads(M) 0.5 0.5 1 28 100 Read length 100 200 350 35 75 Run time (d) 0.2 5 0.3 0.4 4.5 Images (TB) 0.01 0.03 0.03 0.5 1.7 ABI Solid 1 2 40 115 25 35 6 5 1.8 2.5 3 400 50 6-7 3 New Systems Ion Torrent Ion torrent sequencing works much like 454 sequencing, except instead of detecting released PPi by conversion to a flash of light, the production of H+ ions is sensed using semiconductor technology. This makes the instrument much cheaper, essentially $50,000 instead of $500,000. Single molecule sequencing (SMRT) SMRT is carried out with incorporation of nucleotides that give off a base specific fluorescent signal on each addition. The templates are distributed one per well (or more exactly, there is one DNA polymerase per well). The wells are constructed with highly sensitive sensors so that the signal from one DNA molecule can be read. A variety of other systems are nearing the market. Multiple displacement amplification Multiple displacement amplification is a new method for amplifying genomic DNA prior to next generation sequencing. It can prepare sufficient DNA for sequencing starting from a single cell. It is used when variation in the DNA sequence within each individual cell is of interest. In human cells, this might be relevant, for example, if one were observing the different DNA rearrangements within individual lymphocytes. In microbiology, this is used to recover sequences of genomes from bacteria that can not be cultured. The method usually involves a cell sorter to sort out an individual cell. The cell is opened, the DNA is denatured, and then polymerization is carried out with phi29 DNA polymerase and random hexamer primers. Each extending strand encounters the back of some other primed unit and displaces it. The displaced strand is also the target for hexamer priming, which displaces strands which then become targets for more priming, etc. There is no thermal cycling. The polymerase has proofreading activity and produces low error rates. The number of polymerizations to the final product is fewer than in PCR and so fewer errors build up. The error rate can be as low as 10-6. The highly nested DNA is then disaggregated by digestion with the single strand-specific nuclease S1. This DNA can then be the subject of next generation sequencing. Ref: Zhang et al., Nat. Biotechnol. 24:680-686 (2006) Qiagen sells a kit to conduct MDA. Image from the Qiagen Web site.