Next Generation Sequencing, Assembly, and Alignment Methods Andy Nagar Agenda • • • • • • • Background Next Generation Sequencing Sequence Assembly Sequence Alignment Traditional Alignment Algorithms Next Generation Alignment Algorithms Conclusion Andy Nagar 2 Background • Earlier sequencing methods were based on Sanger sequencing, which goes back to the 1970s. • Sequencing was slow, bases were read one at a time. • Separation is done by electrophoresis. • Readout by fluorescent tags. Andy Nagar 3 Source:[Wikipedia] Background • To complete second generation genome projects such as the Human Genome Project, need for faster and high-throughput sequencing. • Next-Generation Sequencing technologies based on various implementations of cyclic array sequencing. • Cyclic Array Sequencing is based on the idea of sequencing of an array of DNA features by continuous process of enzymatic separation and imaging-based data collection. Andy Nagar 4 Growth in Sequencing Growth of Next - Gen Sequencing – doubles every month Andy Nagar 5 Source:[6] Next Generation Sequencing • Workflow : •DNA is fragmented •Adaptors ligated to fragments •Several possible protocols yield array of PCR colonies. •Enyzmatic extension with fluorescently tagged nucleotides. •Cyclic readout by imaging the array. Andy Nagar 6 Source:[10] Next Generation Sequencing • Reads are done in parallel to speed up the sequencing. Andy Nagar 7 Source:[11] NGS - Products • - Products based on cyclic array sequencing include: Roche’s 454 Illumina’s Genome Analyzer ABI’s SOLiD HeliScope • They allow the sequencing of millions of short sequences (reads) simultaneously, and can sequence entire human genome in a few days [Magi et al 2010]. Andy Nagar 8 NGS - Products Andy Nagar 9 Source:[13] Comparison of existing methods Andy Nagar 10 Source:[4] Whole Genome Shotgun Sequences (WGS) • DNA is broken up randomly into numerous small segments. • Multiple overlapping reads for the target DNA are obtained by performing several rounds of this fragmentation and sequencing. • Computer programs then use the overlapping ends of different reads to assemble them into a continuous sequence. Andy Nagar 11 Sequencing Andy Nagar 12 Source:[9] How to ensure enough coverage Andy Nagar 13 Source:[9] Whole Genome Shotgun Sequences (WGS) Andy Nagar Source: http://www.nature.com/scitable/topicpage/complex-genomes-shotgun-sequencing-609 14 Assembly - Reconstructing the Genome • 2 possible methods of assembly: 1. Consensus Overlap Assembly: The overlap consensus assembly method uses the overlap between sequence reads to create a link between them. The contig is eventually formed by reading along the links as far as possible. Problematic for short reads: - Overlaps must be calculated over a large proportion of the read - Huge number of reads increases the number of links, so contig path is difficult to compute. Andy Nagar 15 Assembly - Reconstructing the Genome • 2 possible methods of assembly: 2. de Bruijn Graph Approach: -All k-mers are computed and the reads are represented as a path through the k-mers. - A de Bruijn graph is a graph in which the nodes are sets of symbols (i.e. nucleotides) and the edges represent overlaps between the symbols. This is a convenient way to represent data, such as overlapping sequence reads - de Bruijn graphs handle redundancy better and can assemble sequences more efficiently. Andy Nagar 16 Assembly - Reconstructing the Genome Andy Nagar 17 Source:[13] Assembly - Reconstructing the Genome Andy Nagar 18 Source:[12] Assembly –de Bruijn Graph • Reads are parsed into 4-mers • Matches are found and de Bruijn Graph is created • There can be more than one path in the graph. => Practical problems of assembly. Andy Nagar 19 Source:[12] What can we do about repeats? Two main approaches: • Cluster the reads • Link the reads Andy Nagar 20 Source:[9] What can we do about repeats? Two main approaches: • Cluster the reads • Link the reads Andy Nagar 21 Source:[9] What can we do about repeats? Two main approaches: • Cluster the reads • Link the reads Andy Nagar 22 Source:[9] Traditional Sequence Alignment • 2 types of traditional Sequence Alignment Algorithms: 1. Hash-table based eg: BLAST (and its variants)=> keep track of each kmer in a hash table with sequence being the key [14][15]. SSAHA => builds a position sensitive hash-table [17]. Advantage: Fast search, allows gapped searches. Drawback: Large memory requirement to store the hash table. Andy Nagar 23 Traditional Sequence Alignment 2. Tree-based search eg: Suffix and Prefix tries Advantage: Fast search, can easily search for sub-strings or patterns. Drawback: Inserting new sequences required rebuilding the tree. Andy Nagar 24 Traditional Sequence Alignment – Suffix Tree Represents “NA” Represents “ANA” NA is suffix of ANA so suffix link Suffix tree for the string BANANA. Each substring is terminated with special character $. The six paths from the root to a leaf (shown as boxes) correspond to the six suffixes A$, NA$, ANA$, NANA$, ANANA$ and BANANA$. The numbers in the leaves give the start position of the corresponding suffix. Suffix links drawn dashed. Andy Nagar 25 Source:[19] Next Generation Sequence Alignment • With high throughput sequencing, millions of reads are obtained in a single run. • “Read-mapping” problem: How do the reads fit in the reference genome. Find hits where these reads occur in the genome. Report position(s) and frequency of hits. A short read may map to many chromosomes in a genome. Andy Nagar 26 Next Generation Sequence Alignment Andy Nagar 27 Source:[25] Next Generation Sequence Alignment • Burrows-Wheeler Transform can be used to find matches of a query string inside a reference string. Steps: 1. Create a suffix array in which each element is a cyclic permutation of the original string terminated by end character “$”. Example: String “googol”. Original String: googol$ 1st circular permutation=> oogol$g 2nd circular permutation => ogol$go … till $ moves to front of the string last circular permutation => $googol Andy Nagar 28 Source:[27] Next Generation Sequence Alignment Steps: 2. Sort the elements of the suffix array in a lexicographic order. $ is lexicographically the smallest element S(i) represents the index in suffix array i represents index in BW Array BW Array Note: All occurrences of any substring occur next to each other in the BW Array. Such range is called the Suffix Array Interval (SA Interval). For example “go” occurs as prefix in positions 1 and 2. SA Interval of “go” = [1,2] Andy Nagar 29 Source:[27] Next Generation Sequence Alignment Steps: SA Interval of “go” = [1,2] Value of S(i) give the corresponding positions in original string. Here the S(i) values and 3 and 0. BW Array X = googol$ This algorithm has many extensions for finding inexact and gapped matches. More details in reference [27] Andy Nagar 30 Source:[27] Conclusion • Next Generation Sequencing is transforming the fields of genetics, molecular biology and bioinformatics. • Enormous amounts of data produced by sequencing projects. • Computing and data analysis are lagging behind. • Need for more efficient data analysis and storage methods. • Use of data mining to find useful information fast and without need to store the entire data. Andy Nagar 31 Conclusion • More efficient assembly and alignment techniques needed. • Need for “metagenomic” analysis – find out which organisms or species are present in a biological or environmental sample. Andy Nagar 32 References Andy Nagar 33 References Andy Nagar 34 References Andy Nagar 35