CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Tools (BLAST) Multiple Sequence Alignment

advertisement
CS 5263 Bioinformatics
Lecture 7: Heuristic Sequence
Alignment Tools (BLAST)
Multiple Sequence Alignment
Roadmap
• Last lecture review
– Sequence alignment statistics
• Today
– Heuristic alignment algorithms
• Basic Local Alignment Search Tools
– Multiple sequence alignment algorithms
Sequence Alignment Statistics
• Substitution matrices
– How is the BLOSUM-N matrix made
– How to make your own substitution matrix
– What’s the meaning of an arbitrary substitution matrix
• Significance of sequence alignments
– P-value estimation for
• Global alignment scores
• Local alignment scores
– Critical score for local alignment to guarantee
significance
Heuristic Local Aligners
-- BLAST and alike
State of biological databases
Sequenced Genomes:
Human
Mouse
Neurospora
Tetraodon
Drosophila
Rice
sea squirts
3109
2.7109
4107
3108
1.2108
1.0109
1.6108
Current rate of sequencing:
4 big labs  3 109 bp /year/lab
10s small labs
Private sectors
Yeast
Rat
Fugu fish
Mosquito
Worm
Arabidopsis
1.2107
2.6109
3.3108
2.8108
1.0108
1.2108
State of biological databases
Number of genes in these genomes:
Vertebrate: ~30,000
Insects:
~14,000
Worm:
~17,000
Fungi:
~6,000-10,000
Small organisms: 100s-1,000s
Each known or predicted gene has an associated protein
sequence
>1,000,000 known / predicted protein sequences
Some useful applications of
alignments
Given a newly discovered gene,
- Does it occur in other species?
Assume we try Smith-Waterman:
Our
new
gene
104
The entire genomic database
1010 - 1011
May take several weeks!
Some useful applications of
alignments
Given a newly sequenced organism,
- Which subregions align with other organisms?
- Potential genes
- Other functional units
Assume we try Smith-Waterman:
Our newly
sequenced
mammal
3109
The entire genomic database
1010 - 1011
> 1000 years ???
BLAST
• Basic Local Alignment Search Tool
– Altschul, Gish, Miller, Myers, Lipman, J Mol Biol 1990
– The most widely used bioinformatics tool
– One of the most cited papers in the history of science
• Which is better: long mediocre match or a few nearby,
short, strong matches with the same total score?
– Score-wise, exactly equivalent
– Biologically, later may be more interesting, & is common
– At least, if must miss some, rather miss the former
• BLAST is a heuristic algorithm emphasizing the later
– speed/sensitivity tradeoff: BLAST may miss former, but gains
greatly in speed
BLAST
• Available at NCBI (National Center for
Biotechnology Information) for download and
online use. http://blast.ncbi.nlm.nih.gov/
• Along with many sequence databases
query
Main idea:
1.Construct a dictionary of all the words in
the query
2.Initiate a local alignment for each word
match between query and DB
Running Time: O(MN)
However, orders of magnitude faster
than Smith-Waterman
DB
BLAST  Original Version
……
Dictionary:
All words of length k (~11 for DNA, 3
for proteins)
Alignment initiated between words of
alignment score  T (typically T = k)
query
……
Alignment:
Ungapped extensions until score
below statistical threshold
scan
DB
Output:
All local alignments with score
> statistical threshold
query
BLAST  Original Version
A C G A A G T A A G G T C C A G T
k = 4, T = 4
The matching word GGTC
initiates an alignment
Extension to the left and
right with no gaps until
alignment falls < 50%
Output:
GTAAGGTCC
GTTAGGTCC
C C C T T C C T G G A T T G C G A
Example:
Gapped BLAST
Added features:
•
•
Pairs of words can
initiate alignment
Extensions with
gaps in a band
around anchor
Output:
GTAAGGTCCAGT
GTTAGGTC-AGT
C T G A T C C T G G A T T G C G A
A C G A A G T A A G G T C C A G T
Example
Query: gattacaccccgattacaccccgattaca (29 letters)
[2 mins]
Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS
sequences) 1,726,556 sequences; 8,074,398,388 total letters
>gi|28570323|gb|AC108906.9| Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence,
complete sequence Length = 144487 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21
(95%) Strand = Plus / Plus
Query:
Sbjct:
4
tacaccccgattacaccccga 24
||||||| |||||||||||||
125138 tacacccagattacaccccga 125158
Score = 34.2 bits (17),
Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus
Query:
Sbjct:
4 tacaccccgattacaccccga 24
||||||| |||||||||||||
125104 tacacccagattacaccccga 125124
>gi|28173089|gb|AC104321.7|
Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence,
complete sequence Length = 139823 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21
(95%) Strand = Plus / Plus
Query:
Sbjct:
4 tacaccccgattacaccccga 24
||||||| |||||||||||||
3891 tacacccagattacaccccga 3911
Example
Query: Human atoh enhancer, 179 letters
[1.5 min]
Result: 57 blast hits
1.
2.
3.
4.
5.
6.
7.
8.
gi|7677270|gb|AF218259.1|AF218259 Homo sapiens ATOH1 enhanc...
gi|22779500|gb|AC091158.11| Mus musculus Strain C57BL6/J ch...
gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhanc...
gi|28875397|gb|AF467292.1| Gallus gallus CATH1 (CATH1) gene...
gi|27550980|emb|AL807792.6| Zebrafish DNA sequence from clo...
gi|22002129|gb|AC092389.4| Oryza sativa chromosome 10 BAC O...
gi|22094122|ref|NM_013676.1| Mus musculus suppressor of Ty ...
gi|13938031|gb|BC007132.1| Mus musculus, Similar to suppres...
355 1e-95
264 4e-68
256 9e-66
78 5e-12
54 7e-05
44 0.068
42 0.27
42 0.27
gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhancer sequence Length = 1517
Score = 256 bits (129), Expect = 9e-66 Identities = 167/177 (94%),
Gaps = 2/177 (1%) Strand = Plus / Plus
Query:
3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62
||||||||||||| ||||||||||||||||||| ||||||||||||||||||||||||||
Sbjct: 1144 tgacaatagaggggctggcagaggctcctggccccggtgcggagcgtctggagcggagca 1203
Query:
63 cgcgctgtcagctggtgagcgcactctcctttcaggcagctccccggggagctgtgcggc 122
|||||||||||||||||||||||||| ||||||||| |||||||||||||||| |||||
Sbjct: 1204 cgcgctgtcagctggtgagcgcactc-gctttcaggccgctccccggggagctgagcggc 1262
Query:
123 cacatttaacaccatcatcacccctccccggcctcctcaacctcggcctcctcctcg 179
||||||||||||| || ||| |||||||||||||||||||| |||||||||||||||
Sbjct: 1263 cacatttaacaccgtcgtca-ccctccccggcctcctcaacatcggcctcctcctcg 1318
BLAST Score: bit score vs raw score
• Bit score is converted from raw score by
taking into account K and :
S’ = ( S – log K) / log 2
• To compute E-value from bit score:
E = KM’N’ e-S = M’N’ 2-S’
• Critical score is now:
S* = log2(M’N’)
If S’ >> S*: significant
If S’ << S*: not significant
(M’ ~ M, N’ ~ N)
Different types of BLAST
• blastn: search nucleic acid databases
• blastp: search protein databases
• blastx: you give a nucleic acid sequence,
search protein databases
• tblastn: you give a protein sequence,
search nucleic acid databases
• tblastx: you give a nucleic sequence,
search nucleic acid database, implicitly
translate both into protein sequences
BLAST cons and pros
• Advantages
– Fast!!!!
– A few minutes to search a database of 1011 bases
• Disadvantages
– Sensitivity may be low
– Often misses weak homologies
• New improvement
– Make it even faster
• Mainly for aligning very similar sequences or really long sequences
– E.g. whole genome vs whole genome
– Make it more sensitive
• PSI-BLAST: iteratively add more homologous sequences
• PatternHunter: discontinuous seeds
Variants of BLAST
NCBI-BLAST: most widely used version
WU-BLAST: (Washington University BLAST): another popular version
Optimized, added features
MEGABLAST:
Optimized to align very similar sequences. Linear gap penalty
BLAT: Blast-Like Alignment Tool
BlastZ:
Optimized for aligning two genomes
PSI-BLAST:
BLAST produces many hits
Those are aligned, and a pattern is extracted
Pattern is used for next search; above steps iterated
Sensitive for weak homologies
Slower
Pattern hunter
• Instead of exact matches of consecutive
matches of k-mer, we can
• look for discontinuous matches
– My query sequence looks like:
• ACGTAGACTAGCAGTTAAG
– Search for sequences in database that match
• AXGXAGXCTAXC
• X stands for don’t care
• Seed: 101011011101
Pattern hunter
• A good seed may give you both a higher
sensitivity and higher specificity
• You may think 110110110110 is the best seed
– Because mutation in the third position of a codon
often doesn’t change the amino acid
– Best seed is actually 110100110010101111
• Empirically determined
• How to design such seed is an open problem
• May combine multiple random seeds
Things we’ve covered so far
• Global alignment
– Needleman-Wunsch and variants
• Local Alignment
– Smith-Waterman
• Improvement on space and time
• More accurate gap penalty models
• Heuristic algorithms
– BLAST families
• Statistics for sequence alignment
• Commonality: They all deal with aligning two
sequences
– Pair-wise sequence alignment
• Next: Aligning multiple sequences all together
– Multiple sequence alignment
• Motivation:
– A faint similarity between two sequences becomes
very significant if present in many sequences
• Protein domains
• Motifs responsible for gene regulation
Definition
• Given N sequences x1, x2,…, xN:
– Insert gaps (-) in each sequence xi, such that
• All sequences have the same length L
• Score of the global mapping is maximum
• Pairwise alignment: a hypothesis on the
evolutionary relationship between the letters of
two sequences
• Same for a multiple alignment!
Scoring Function
• Ideally:
– Find alignment that maximizes probability that
sequences evolved from common ancestor
x
?
y
z
w
v
Phylogenetic tree
or evolution tree
Scoring Function (cont’d)
• Unfortunately: too many parameters
• Compromises:
– Ignore phylogenetic tree
• Compute from pair-wise scores
– Based on sum of all pair-wise scores
– Based on scores with a consensus sequence
First assumption
• Columns are independent
– Similar in pair-wise alignment
• Therefore, the score of an alignment is the sum
of all columns
• Need to decide how to score a single column
Scoring Function: Sum Of Pairs
Definition: Induced pairwise alignment
A pairwise alignment induced by the multiple
alignment
Example:
Induces: -
x: AC-GCGG-C
y: AC-GC-GAG
z: GCCGC-GAG
x: ACGCGG-C;
y: ACGC-GAC;
-
x: AC-GCGG-C;
z: GCCGC-GAG;
-
y: AC-GCGAG
z: GCCGCGAG
-
Sum Of Pairs (cont’d)
• The sum-of-pairs score of an alignment is
the sum of the scores of all induced
pairwise alignments
S(m) = k<l s(mk, ml)
s(mk, ml): score of induced alignment (k,l)
Example:
x:
y:
z:
AC-GCGG-C
AC-GC-GAG
GCCGC-GAG
A C G T A
1 -1 -1 -1 -1
C -1
(A,A) +
(A,G) x 2
= -1
(C,C) x 3 (-,A) x 2 +
=3
(A,A) = -1
Total score = (-1) + 3 + (-2) + 3 + 3
+ (-2) + 3 + (-1) + (-1) = 5
1 -1 -1 -1
G -1 -1
1 -1 -1
T -1 -1 -1
1 -1
- -1 -1 -1 -1
0
Sum Of Pairs (cont’d)
• Drawback: no evolutionary characterization
– Every sequence derived from all others
• Heuristic way to incorporate evolution tree
– Weighted Sum of Pairs:
S(m) = k<l wkl s(mk, ml)
wkl: weight decreasing with distance
Human
Mouse
Duck
Chicken
Consensus score
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC
Consensus sequence:
CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• Find optimal consensus string m* to maximize
S(m) = i s(m*, mi)
s(mk, ml): score of pairwise alignment (k,l)
Multiple Sequence Alignments
Algorithms
Multidimensional Dynamic
Programming (MDP)
Generalization of Needleman-Wunsh:
• Find the longest path in a high-dimensional cube
– As opposed to a two-dimensional grid
• Uses a N-dimensional matrix
– As apposed to a two-dimensional array
• Entry F(i1, …, ik) represents score of optimal
alignment for s1[1..i1], … sk[1..ik]
F(i1,i2,…,iN) = max(all neighbors of a cell) (F(nbr)+S(current))
Multidimensional Dynamic
Programming (MDP)
• Example: in 3D (three sequences):
(i-1,j-1,k-1)
• 23 – 1 = 7 neighbors/cell
F(i-1,j-1,k-1) + S(xi, xj, xk),
F(i-1,j-1,k ) + S(xi, xj, -),
F(i-1,j ,k-1) + S(xi, -, xk),
F(i,j,k) = max F(i ,j-1,k-1) + S(-, xj, xk),
F(i-1,j ,k ) + S(xi, -, -),
F(i ,j-1,k ) + S(-, xj, -),
F(i ,j ,k-1) + S(-, -, xk)
(i-1,j-1,k)
(i,j-1,k-1)
(i,j-1,k)
(i-1,j,k-1)
(i-1,j,k)
(i,j,k-1)
(i,j,k)
Multidimensional Dynamic
Programming (MDP)
Running Time:
1. Size of matrix: LN;
Where L = length of each sequence
N = number of sequences
2. Neighbors/cell: 2N – 1
Therefore………………………… O(2N LN)
Faster MDP
• Carrillo & Lipman, 1988
– Branch and bound
– Other heuristics
• Practical for about 6 sequences of length
about 200-300.
Progressive Alignment
•
•
Multiple Alignment is NP-hard
Most used heuristic: Progressive Alignment
Algorithm:
1.
2.
3.
4.
Align two of the sequences xi, xj
Fix that alignment
Align a third sequence xk to the alignment xi,xj
Repeat until all sequences are aligned
Running Time: O(NL2)
Each alignment takes O(L2)
Repeat N times
Progressive Alignment
x
y
z
w
• When evolutionary tree is known:
– Align closest first, in the order of the tree
Example:
Order of alignments:
1. (x,y)
2. (z,w)
3. (xy, zw)
Progressive Alignment: CLUSTALW
CLUSTALW: most popular multiple protein alignment
Algorithm:
1. Find all dij: alignment dist (xi, xj)
•
High alignment score => short distance
2. Construct a tree
(Neighbor-joining hierarchical clustering. Will discuss in future)
3. Align nodes in order of decreasing similarity
+ a large number of heuristics
CLUSTALW example
•
•
•
•
S1 ALSK
S2 TNSD
S3 NASK
S4 NTSD
CLUSTALW example
•
•
•
•
S1 ALSK
S2 TNSD
S3 NASK
S4 NTSD
s1 s2 s3 s4
s1 0
9
4
7
s2
0
8
3
0
7
s3
s4
0
Distance matrix
CLUSTALW example
•
•
•
•
S1 ALSK
S2 TNSD
S3 NASK
S4 NTSD
s1 s2 s3 s4
s1 0
9
4
7
s2
0
8
3
0
7
s3
s4
0
s1
s3
s2
s4
CLUSTALW example
•
•
•
•
S1 ALSK
S2 TNSD
S3 NASK
S4 NTSD
-ALSK
NA-SK
s1 s2 s3 s4
s1 0
9
4
7
s2
0
8
3
0
7
s3
s4
0
s1
s3
s2
s4
CLUSTALW example
•
•
•
•
S1 ALSK
S2 TNSD
S3 NASK
S4 NTSD
-ALSK
NA-SK
-TNSD
NT-SD
s1 s2 s3 s4
s1 0
9
4
7
s2
0
8
3
0
7
s3
s4
0
s1
s3
s2
s4
CLUSTALW example
•
•
•
•
S1 ALSK
S2 TNSD
S3 NASK
S4 NTSD
-ALSK
-TNSD
NA-SK
NT-SD
s1 s2 s3 s4
s1 0
9
4
7
s2
0
8
3
0
7
s3
s4
0
-ALSK
NA-SK
-TNSD
NT-SD
s1
s3
s2
s4
Iterative Refinement
Problems with progressive alignment:
• Depend on pair-wise alignments
• If sequences are very distantly related, much higher likelihood of
errors
• Initial alignments are “frozen” even when new evidence comes
Example:
x:
y:
GAAGTT
GAC-TT
Frozen!
z:
w:
GAACTG
GTACTG
Now clear: correct y should be GA-CTT
Iterative Refinement
Algorithm (Barton-Stenberg):
1.
2.
3.
4.
Align most similar xi, xj
Align xk most similar to (xixj)
Repeat 2 until (x1…xN) are aligned
For j = 1 to N,
Remove xj, and realign to x1…xj-1xj+1…xN
5. Repeat 4 until convergence
Progressive
alignment
Iterative Refinement (cont’d)
For each sequence y
1. Remove y
2. Realign y
z
x
y
(while rest fixed)
allow y to vary
x,z fixed projection
Note: Guaranteed to converge (why?)
Running time: O(kNL2), k: number of iterations
Iterative Refinement
Example: align (x,y), (z,w), (xy, zw):
x:
y:
z:
w:
GAAGTTA
GAC-TTA
GAACTGA
GTACTGA
After realigning y:
x:
y:
z:
w:
GAAGTTA
G-ACTTA
GAACTGA
GTACTGA
+ 3 matches
Iterative Refinement
• Example not handled well:
x:
y1:
y2:
y3:
GAAGTTA
GAC-TTA
GAC-TTA
GAC-TTA
z:
w:
GAACTGA
GTACTGA
Realigning any single yi
changes nothing
Restricted MDP
•
Similar to bounded DP in pair-wise alignment
1. Construct progressive multiple alignment m
2. Run MDP, restricted to radius R from m
z
y
Running Time: O(2N RN-1 L)
x
Restricted MDP
x:
y1:
y2:
y3:
GAAGTTA
GAC-TTA
GAC-TTA
GAC-TTA
z:
w:
GAACTGA
GTACTGA
• Within radius 1 of the optimal
 Restricted MDP will fix it.
Other approaches
• Profile Hidden Markov Models
– Statistical learning methods
– Will discuss in future
Multiple alignment tools
• Clustal W (Thompson, 1994)
– Most popular
•
•
•
•
•
•
•
PRRP (Gotoh, 1993)
HMMT (Eddy, 1995)
DIALIGN (Morgenstern, 1998)
T-Coffee (Notredame, 2000)
MUSCLE (Edgar, 2004)
Align-m (Walle, 2004)
PROBCONS (Do, 2004)
In summary
• Multiple alignment algorithms:
– MDP (too slow)
• Branch & Bound doesn’t solve the problem entirely
– Progressive alignment: clustalW
– Iterative refinement
– Restricted MDP
Download