Next Generation Sequence Alignment

advertisement
Next Generation Sequencing,
Assembly, and Alignment Methods
Andy Nagar
Agenda
•
•
•
•
•
•
•
Background
Next Generation Sequencing
Sequence Assembly
Sequence Alignment
Traditional Alignment Algorithms
Next Generation Alignment Algorithms
Conclusion
Andy Nagar
2
Background
• Earlier sequencing methods were based on Sanger
sequencing, which goes back to the 1970s.
• Sequencing was slow, bases were read one at a time.
• Separation is done by
electrophoresis.
• Readout by fluorescent tags.
Andy Nagar
3
Source:[Wikipedia]
Background
• To complete second generation genome projects such
as the Human Genome Project, need for faster and
high-throughput sequencing.
• Next-Generation Sequencing technologies based on
various implementations of cyclic array sequencing.
• Cyclic Array Sequencing is based on the idea of
sequencing of an array of DNA features by continuous
process of enzymatic separation and imaging-based
data collection.
Andy Nagar
4
Growth in Sequencing
Growth of Next - Gen
Sequencing – doubles every month
Andy Nagar
5
Source:[6]
Next Generation Sequencing
• Workflow :
•DNA is fragmented
•Adaptors ligated to fragments
•Several possible protocols
yield array of PCR colonies.
•Enyzmatic extension with
fluorescently tagged
nucleotides.
•Cyclic readout by imaging the
array.
Andy Nagar
6
Source:[10]
Next Generation Sequencing
• Reads are done in parallel to speed up the sequencing.
Andy Nagar
7
Source:[11]
NGS - Products
•
-
Products based on cyclic array sequencing include:
Roche’s 454
Illumina’s Genome Analyzer
ABI’s SOLiD
HeliScope
• They allow the sequencing of millions of short
sequences (reads) simultaneously, and can sequence
entire human genome in a few days [Magi et al 2010].
Andy Nagar
8
NGS - Products
Andy Nagar
9
Source:[13]
Comparison of existing methods
Andy Nagar
10
Source:[4]
Whole Genome Shotgun Sequences (WGS)
• DNA is broken up randomly into numerous small
segments.
• Multiple overlapping reads for the target DNA are
obtained by performing several rounds of this
fragmentation and sequencing.
• Computer programs then use the overlapping ends of
different reads to assemble them into a continuous
sequence.
Andy Nagar
11
Sequencing
Andy Nagar
12
Source:[9]
How to ensure enough coverage
Andy Nagar
13
Source:[9]
Whole Genome Shotgun Sequences (WGS)
Andy Nagar
Source: http://www.nature.com/scitable/topicpage/complex-genomes-shotgun-sequencing-609
14
Assembly - Reconstructing the Genome
• 2 possible methods of assembly:
1. Consensus Overlap Assembly:
The overlap consensus assembly method uses the overlap
between sequence reads to create a link between them. The
contig is eventually formed by reading along the links as far as
possible.
Problematic for short reads:
- Overlaps must be calculated over a large proportion of the read
- Huge number of reads increases the number of links, so contig
path is difficult to compute.
Andy Nagar
15
Assembly - Reconstructing the Genome
• 2 possible methods of assembly:
2. de Bruijn Graph Approach:
-All k-mers are computed and the reads are represented as a path
through the k-mers.
- A de Bruijn graph is a graph in which the nodes are sets of
symbols (i.e. nucleotides) and the edges represent overlaps
between the symbols. This is a convenient way to represent data,
such as overlapping sequence reads
- de Bruijn graphs handle redundancy better and can assemble
sequences more efficiently.
Andy Nagar
16
Assembly - Reconstructing the Genome
Andy Nagar
17
Source:[13]
Assembly - Reconstructing the Genome
Andy Nagar
18
Source:[12]
Assembly –de Bruijn Graph
• Reads are parsed into 4-mers
• Matches are found and de Bruijn Graph is created
• There can be more than one path in the graph.
=> Practical problems of assembly.
Andy Nagar
19
Source:[12]
What can we do about repeats?
Two main approaches:
• Cluster the reads
•
Link the reads
Andy Nagar
20
Source:[9]
What can we do about repeats?
Two main approaches:
• Cluster the reads
•
Link the reads
Andy Nagar
21
Source:[9]
What can we do about repeats?
Two main approaches:
• Cluster the reads
•
Link the reads
Andy Nagar
22
Source:[9]
Traditional Sequence Alignment
• 2 types of traditional Sequence Alignment
Algorithms:
1. Hash-table based
eg: BLAST (and its variants)=> keep track of each kmer in a hash table with sequence being the key
[14][15].
SSAHA => builds a position sensitive hash-table [17].
Advantage: Fast search, allows gapped searches.
Drawback: Large memory requirement to store the hash
table.
Andy Nagar
23
Traditional Sequence Alignment
2. Tree-based search
eg: Suffix and Prefix tries
Advantage: Fast search, can easily search for sub-strings
or patterns.
Drawback: Inserting new sequences required rebuilding the tree.
Andy Nagar
24
Traditional Sequence Alignment – Suffix Tree
Represents “NA”
Represents “ANA”
NA is suffix of ANA so suffix link
Suffix tree for the string BANANA.
Each substring is terminated with
special character $.
The six paths from the root to a leaf
(shown as boxes) correspond to the
six suffixes
A$,
NA$,
ANA$,
NANA$,
ANANA$ and
BANANA$.
The numbers in the leaves give the
start position of the corresponding
suffix.
Suffix links drawn dashed.
Andy Nagar
25
Source:[19]
Next Generation Sequence Alignment
• With high throughput sequencing, millions of reads
are obtained in a single run.
• “Read-mapping” problem:
How do the reads fit in the reference genome.
Find hits where these reads occur in the genome.
Report position(s) and frequency of hits.
A short read may map to many chromosomes in a
genome.
Andy Nagar
26
Next Generation Sequence Alignment
Andy Nagar
27
Source:[25]
Next Generation Sequence Alignment
• Burrows-Wheeler Transform can be used to find
matches of a query string inside a reference string.
Steps:
1. Create a suffix array in which each element is a cyclic
permutation of the original string terminated by end character
“$”.
Example: String “googol”.
Original String: googol$
1st circular permutation=> oogol$g
2nd circular permutation => ogol$go
…
till $ moves to front of the string
last circular permutation => $googol
Andy Nagar
28
Source:[27]
Next Generation Sequence Alignment
Steps:
2. Sort the elements of the suffix array in a lexicographic order.
$ is lexicographically the smallest element
S(i) represents the index in suffix array
i represents index in BW Array
BW Array
Note: All occurrences of any substring
occur next to each other in the BW
Array. Such range is called the
Suffix Array Interval (SA Interval).
For example “go” occurs as prefix
in positions 1 and 2.
SA Interval of “go” = [1,2]
Andy Nagar
29
Source:[27]
Next Generation Sequence Alignment
Steps:
SA Interval of “go” = [1,2]
Value of S(i) give the corresponding
positions in original string.
Here the S(i) values and 3 and 0.
BW Array
X = googol$
This algorithm has many extensions for finding inexact and
gapped matches. More details in reference [27]
Andy Nagar
30
Source:[27]
Conclusion
• Next Generation Sequencing is transforming the
fields of genetics, molecular biology and
bioinformatics.
• Enormous amounts of data produced by sequencing
projects.
• Computing and data analysis are lagging behind.
• Need for more efficient data analysis and storage
methods.
• Use of data mining to find useful information fast
and without need to store the entire data.
Andy Nagar
31
Conclusion
• More efficient assembly and alignment techniques
needed.
• Need for “metagenomic” analysis – find out which
organisms or species are present in a biological or
environmental sample.
Andy Nagar
32
References
Andy Nagar
33
References
Andy Nagar
34
References
Andy Nagar
35
Download
Study collections