New stuff

advertisement
New stuff
Dynamic programming
We want to align following two sequences:
ABCDE
PQRST
If you already have the optimal solution for:
A…D
P…R
then you know the next pair of characters will be one of
these:
A…DE
P…RS
A…DP…RS
A…DE
P…R-
You can extend the match by determining which of these has
the highest score.
New best alignment = previous best + local best
Best previous alignment
Sequence A
...
...
...
...
Sequence B
Dynamic programming (DP)
• General class of algorithms typically applied to
optimization problems.
• Recursive approach.
• Original problem is broken into smaller subproblems
and then solved.
• Pieces of larger problem have a sequential
dependency.
• 4th piece can be solved using solution of the 3rd
piece, the 3rd piece can be solved by using solution of
the 2nd piece and so on…
DP algorithms
• Global alignment - Needlman-Wunsch
• Local alignment - Smith-Waterman
• Guaranteed to provide the optimal alignment.
• Disadvantages:
• Slow due to the very large number of computational steps: O(n2).
• Computer memory requirements also increase with the square of
the sequence lengths.
• Therefore, it is difficult to use the method for very long sequences.
• Many alignments may give the same optimum score. And none of
these correspond to the biologically correct alignment.
Homology vs. similarity again
• Just a reminder of the important concept in sequence
analysis – homology. It is a conclusion about a common
ancestral relationship drawn from sequence similarity.
• Sequence similarity is a direct result of observation from
the sequence alignment. It can be quantified using
percentages, but homology can not!
• It is important to understand this difference between
homology and similarity.
• If the similarity is high enough, a common evolutionary
relationship can be inferred.
Limits of the alignment detection
• However, what is enough? How many mutations can
occur before the differences make two sequences
unrecognizable?
• Intuitively, at some point two homologous sequences
become so divergent that they do not align well.
Twilight zone
• The level one can infer homologous relationship depends
on type of sequence (proteins, NA) and on the length of
the alignment.
• Unrelated sequences of DNA have at least 25% chance to be
identical. For proteins, it is 5%. If gaps are allowed, this percentage
can increase up to 10-20%.
• The shorter the sequence, the higher the chance that some
alignment can be attributed to random chance.
• This suggest that shorter sequences require higher cuttof
for inferring homology than longer sequences.
30%
Essential bioinformatics, Xiong
Determining homology
• It must be stressed that the percentage identity values
only provide a tentative guidance for homology
identification.
• This is not a precise rule for determining sequence
relationships, especially for sequences in the twilight
zone.
• A statistically more rigorous approach to determine
homologous relationships exist. The statistical
significance of the alignment (i.e. its score) can be tested.
• However, I will not cover this advanced topic in this
lecture.
Database similarity searching
Sequence database searching
query sequence
pairwise alignment
closely related matches
target
sequence database
Database searching requirements
• sensitivity – the ability to find as many correct hits (TP)
as possible
• selectivity (specificity) – ability to exclude incorrect hits
(FP)
• speed
• ideally: high sensitivity, high specificity, high speed
• reality: increase in sensitivity leads to decrease in
specificity, improvement in speed often comes at the cost
of lowered sensitivity and selectivity
Types of algorithms
• exhaustive
• uses a rigorous algorithm to find the exact solution for a particular
problem by examining all mathematical combinations
• example: dynamic programming
• heuristic
• computational strategy to find an empirical or near optimal solution
by using rules of thumb
Heuristic algorithms
• Perform faster searches because they examine only a
fraction of the possible alignments examined in regular
dynamic programming
• currently, there are two major algorithms:
• FASTA
• BLAST - Basic Local Alignment Search Tool, Google of the
sequence world
• Not guaranteed to find the optimal alignment or true
homologs, but are 50–100 times faster than DP.
• The increased computational speed comes at a moderate
expense of sensitivity and specificity of the search, which
is easily tolerated by working molecular biologists.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mo.l Biol. 1990 Oct 5;215(3):403-10.
Two components of BLAST
• BLAST consists of two components:
• a search algorithm and
• the evaluation of the quality of solutions
BLAST – ALGORITHM
BLAST strategy
• Basic Local Alignment Search Tool
• Find short stretches (words) of identical or nearly
identical letters in two sequences.
• The basic assumption is that two related sequences must
have at least one word in common.
• By first identifying word matches, a longer alignment can
be obtained by extending similarity regions from the
words.
• Once regions of high sequence similarity are found,
adjacent high-scoring regions can be joined into a full
alignment.
How BLAST works – 1st step
Divide a query sequence into words of length W (W = 3 for
proteins)
LGQALWGQIWW
LGQ
GQA
QAL
ALW
LWG
WGQ
GQI
QIW
IWW
How BLAST works – 1st step
For each of these words, a list of similar words is created
using a substitution matrix (implicit: BLOSUM62).
LGQALWGQIWW
4
6
11
LWG
IWG
MWG
VWG
FWG
LYG
LFG
FWS
AWS
...
...
...
...
...
...
...
...
...
21
19
19
18
17
12
11
11
10
threshold T
How BLAST works – 2nd step
Scan the database sequences for exact matches with the
high-scoring words.
LWG
IWG
MWG
VWG
FWG
LYG
How BLAST works – 3rd step
Extend the exact matches to high-scoring segment pair
(HSP)
LYG
query sequence
database sequence
LGQALWGQIWW
WTDFGYITALYGRINC
How BLAST works – 3rd step
Extend the exact matches to high-scoring segment pair
(HSP)
LYG
query sequence
LGQALWGQIWW
-1-4-1 4 4 2 6 1 4 -4 -2
database sequence
WTDFGYITALYGRINC
S = 12
How BLAST works – 3rd step
Extend the exact matches to high-scoring segment pair
(HSP)
LYG
query sequence
LGQALWGQIWW
-1-4-1 4 4 2 6 1 4 -4 -2
database sequence
WTDFGYITALYGRINC
S = 17
How BLAST works – 3rd step
Extend the exact matches to high-scoring segment pair
(HSP)
LYG
query sequence
LGQALWGQIWW
-1-4-1 4 4 2 6 1 4 -4 -2
database sequence
WTDFGYITALYGRINC
S = 20
How BLAST works – 3rd step
Extend the exact matches to high-scoring segment pair
(HSP)
LYG
query sequence
LGQALWGQIWW
-1-4-1 4 4 2 6 1 4 -4 -2
database sequence
WTDFGYITALYGRINC
S = 12
How BLAST works – 3rd step
Extend the exact matches to high-scoring segment pair
(HSP)
Recent improvement (BLAST 2.0)
LYG
enables the explicit treatment of gaps.
query sequence
LGQALWGQIWW
-1-4-1 4 4 2 6 1 4 -4 -2
database sequence
WTDFGYITALYGRINC
S = 20
HSP
How BLAST works
• Under certain conditions, HSPs can be joined to extend
the alignment.
overlapping HSPs
not that distant HSPs
1
query sequence
For each word, the list of similar words
is created using a substitution matrix
2
database sequences
scan
match
list
The query sequence is cut in words of length W
the extension of
the similarity on
both sides of
the word
extend
3
high scoring pair
BLAST parameters
W : Word size – find W-mers in target/query
2-3 (3) for proteins, 6-11 (28) for NA
T : Neighborhood word score threshold – focus on pairs
more than T
usually 11-13
X : Drop-off – stop extending when score loss is higher
than X
S : Score – the final score of a HSP (this is not a
parameter, just a result)
BLAST variants
BLAST parameters
• Adjusting T and W controls both speed and sensitivity
•
•
•
•
•
(TP) of BLAST
When T is raised, the speed of the search is increased,
but fewer hits are registered, and so distantly related
database matches may be missed.
When T is lowered, the search proceeds more slowly, but
many more word hits are evaluated, and thus sensitivity is
increased.
To speed up BLASTN, increase W (T is not used in
BLASTN, words are always identical)
To speed up BLASTP, set W=3 and T to a large value.
W and T better for controlling speed than X
Which sequence to search?
• The choice of the type of sequences also influences the
sensitivity of the search.
• Clear advantage of using protein sequences in detecting
homologs
• If the input sequence is a protein-encoding DNA sequence, use
BLASTX (six open reading frames before sequence comparisons)
• If you’re looking for protein homologs encoded in newly
sequenced genomes, you may use TBLASTN. This may
help to identify protein coding genes that have not yet
been annotated.
• If a DNA sequence is to be used as the query, a proteinlevel comparison can be done with TBLASTX.
• TBLASTN, TBLASTX are very computationally intensive
and the search process can be very slow.
BLAST – quality assessment
E-value
• expected value
• The E-value estimates the expected number of records
in the database that will be returned with a score as good
as or better than the score of the record under scrutiny.
• An E value of 1 means that in a database of the current
size one might expect to see 1 match with a similar score
simply by chance.
• A value close to zero means that you would practically
expect no unrelated sequence to score as high to your
query sequence.
The interpretation of E-value
• The primary use of the E-value is to help to answer the
question ‘Is this alignment meaningful?’. Not whether it
has biological meaning!
• What is the highest E-value that I should consider as
significant?
• No definite answer, depends on your goals and sequences.
• Generally, the lower the better. Commonly used value: 1E-6
• But, in some cases, this may be too restrictive.
The interpretation of E-value
Bit score
• A typical BLAST output reports E values and scores.
• There are two kinds of scores: raw and bit scores.
• Raw scores S are calculated from the substitution matrix and the
gap penalty parameters.
• The bit score S’ is calculated from the raw score by normalizing
with the statistical variables that define a given scoring system.
• Bit scores from different alignments, even those
employing different scoring matrices in separate BLAST
searches, can be compared.
• E-values can not be compared when searching in different
databases. The bit scores, however, will remain the same.
Download