Pairwise Alignment - Center for Biological Sequence Analysis

advertisement
Center for Biological Sequence Analysis
Database Searching
Using alignment algorithms for
finding similar sequences
Center for Biological Sequence Analysis
Why do we want to compare
sequences?
Evolutionary relationships
• Phylogenetic trees can be constructed based on
comparison of the sequences of a molecule (example:
16S rRNA) taken from different species
• Residues conserved during evolution play an important
role
Prediction of protein structure and function
• Proteins which are very similar in sequence generally
have similar 3D structure and function as well
• By searching a sequence of unknown structure against
a database of known proteins the structure and/or
function can in many cases be predicted
Center for Biological Sequence Analysis
Things to keep in mind when
working with alignments
Pairwise alignment programs always find the
optimal alignment of two sequences
• They do so even if it does not make any sense at all to
align the two sequences
• ”Optimal” means optimal according to the substitution
matrix and gap penalties you choose – also if you
choose the wrong ones
Generally the underlying assumptions are wrong
• The frequency of substitution is not the same at all
positions
• Nor is the frequencies of insertions and deletions the
same
• Affine gap penalties do not properly model indel events
Center for Biological Sequence Analysis
Using sequence alignment to
search databases
The most common usage of pairwise sequence
alignment is searching databases for related
sequences
Although the alignments themselves may be
unreliable the alignment scores gives a lot of
information about which sequences are related
and which are not
Having a set of related sequences is a lot more
informative than just one sequence - even if
nothing is known about the related sequences
Center for Biological Sequence Analysis
Requirements in addition to an
alignment method
A very fast method to find potentially related
sequences
• Systematically searching through the databases with
the alignment methods take too long even though
dynamic programming is fast
• Some method to initially identify possible matches is
therefore needed to speed up the search
A method to evaluate which matches to trust
• Statistics on the alignment score distributions can be
used to calculate the significance of an alignment
• This way we can not only rank which matches are
better than others but also tell if any of them are good
at all
Center for Biological Sequence Analysis
Local or global alignment
Generally local alignment is used for performing
database searches
• For most cases you would be interested in knowing if
any parts of you sequences looks like something else
• The protein sequence databases have not been split
into domains
It is not always the optimal thing to do but …
• In the case where the complete sequence should match
the local alignment score will be almost identical to the
global one
• If you really want a global alignment you can make it
afterwards
Center for Biological Sequence Analysis
Differences between global and
local alignments
Extra constraint on scoring function: The
expected score for a random alignment must be
negative
Because you can to start a new alignment
anywhere dynamic programming scores cannot
become negative
The trace-back is started at the highest values
rather than the lower right corner
The trace-back is stopped as soon as a zero is
encountered
Center for Biological Sequence Analysis
The Smith-Waterman algorithm
(local alignment)
Center for Biological Sequence Analysis
Alignment score distributions
The local similarity scores for ungapped
alignment of random sequences can be shown to
follow an extreme value distribution:
P(Sx) = 1-exp(-Kmne-x),
where m and n are the sequence lengths while K
and  are free parameters
This turns out to be a very good approximation
for gapped alignment as long as reasonably large
gap penalties are used
Center for Biological Sequence Analysis
Database searching
Positive reporting: When searching in a database
we report only the few good matches
The expected number of database hits with a
score of at least x can be calculated as:
E(Sx) = DP(Sx),
where D is the number of entries in the database
E-values are much better for evaluating
alignments than raw alignment scores or
”percent identity”
Center for Biological Sequence Analysis
A curse or a blessing?
Large databases are a blessing …
• They are more likely to contain something similar to
the query
… and a curse
• Increasing the size of the database decreases the
significance of the hits you get
• Searching huge databases requires fast computers
What requirements this puts on software
development
• The programs must be speeded up or database
searches will take longer and longer
• The false positive rate must be reduced to not lose
specificity
Center for Biological Sequence Analysis
Heuristic search algorithms
FASTA (Pearson 1995)
Uses heuristics to avoid
calculating the full
dynamic programming
matrix
Speed up searches by an
order of magnitude
compared to full SmithWaterman
The statistical side of
FASTA is still stronger
than BLAST
BLAST (Altschul 1990,
1997)
Uses rapid word lookup
methods to completely
skip most of the database
entries
Extremely fast
• One order of magnitude
faster than FASTA
• Two orders of magnitude
faster than SmithWaterman
Center for Biological Sequence Analysis
Coffee break
Top 10 ways to tell you drink too much
coffee
10 Juan Valdez names his
donkey after you
9 You get a speeding ticket even
when you're parked
8 You grind your coffee beans in
your mouth
7 You sleep with your eyes open
6 You watch videos in fastforward
5 You lick your coffeepot clean
4 Your eyes stay open when you
sneeze
3 The nurse needs a scientific
calculator to take your pulse
2 You can type sixty words a
minute with your feet
1 You can jump-start your car
without jumper cables.
Center for Biological Sequence Analysis
How BLAST works
The search is speeded up by indexing the
sequence databases in a so-called suffix array
• Three letter subsequences are used as keys to the
sequences
• Closely related substitutions are also included
• This gives ~150 index keys for each sequence
This is used in two ways
• To quickly discard sequences that are not similar at all
before even beginning to align them
• To constrain the alignment and thereby speed up the
alignment procedure itself
Center for Biological Sequence Analysis
Variations on a theme
BLASTN
• Nucleotide query
sequence
• Nucleotide database
TBLASTN
• Protein query sequence
• Nucleotide database
• ”On the fly” six frame
translation of database
BLASTP
• Protein query sequence
• Protein database
BLASTX
• Nucleotide query
sequence
• Protein database
• Compares all six reading
frames with the database
TBLASTX
• Nucleotide query
sequence
• Nucleotide database
• Compares all reading
frames of query with all
reading frames of the
database
Center for Biological Sequence Analysis
BLAST at NCBI
http://www.ncbi.nlm.nih.gov/BLAS
T/
Very fast computer
dedicated to running
BLAST searches
Many databases that
are always up to
date
Nice simple web
interface
But you still need to
knowledge about
BLAST to use it
Center for Biological Sequence Analysis
Performing a simple BLAST
search
We will now do a small exercise together
The purpose of the exercise is simply to
performing a simple BLAST search ”hands on”
Open a web browser on the page
http://www.cbs.dtu.dk/dtucourse/cookbooks/ljj/exercise1.html
Center for Biological Sequence Analysis
The most common and
effective way to ruin your
database search
What you should never ever do: take the
nucleotide sequence of a gene and compare it
with a database at the nucleotide level
• Unfortunately this is a very intuitive thing to do
• On the NCBI BLAST homepage nucleotide search
methods are listed before protein search – making it
even more intuitive
What you should do instead
• Extract the coding part of the DNA sequence, translate
it, and search with the resulting protein sequence
• Use a search method (such as BLASTX or TBLASTX)
which compares the sequences at the protein level
Center for Biological Sequence Analysis
The limits of sequence
similarity
Center for Biological Sequence Analysis
Expectation values in BLAST
BLAST uses precomputed extreme value
distributions to calculate E-values from
alignment scores
• For this reason BLAST only allows certain
combinations of substitution matrices and gap
penalties
• This also means that the fit is based on a different data
set than the one you are working on
A word of caution: BLAST tends to overestimate
the significance of its matches
• E-values from BLAST are fine for identifying sure hits
• One should be careful using BLAST’s E-values to judge
if a marginal hit can be trusted
Center for Biological Sequence Analysis
Evaluating BLAST results
We will now do a second exercise together
The main point of this exercise is careful
interpretation of the BLAST output
Open a web browser on the page
http://www.cbs.dtu.dk/dtucourse/cookbooks/ljj/exercise2.html
Center for Biological Sequence Analysis
Pairwise alignment of
hemoglobin alpha chain and
myoglobin
24.7% identity;
Global alignment score: 130
10
20
30
40
50
HBA_HU VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKG--::: ..
: .:.:: : .. .: . : :.: : : :
: .:
. :..:.
MYG_PH VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED
10
20
30
40
50
60
60
70
80
90
100
110
HBA_HU ---HGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNF-KLLSHCLLVTLAAHL
:: : ::
. .
:. :.. :: : .. :... ...:. .. .: ..
MYG_PH LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKI-PIKYLEFISEAIIHVLHSRH
70
80
90
100
110
120
130
140
HBA_HU PAEFTPAVHASLDKFLASVSTVLTSKYR-----:..:
......: :
...::.
MYG_PH PGDFGADAQGAMNKALELFRKDIAAKYKELGYQG
120
130
140
150
Center for Biological Sequence Analysis
A multiple sequence alignment
of globins
HBB_HUMAN
HBB_HORSE
HBA_HUMAN
HBA_HORSE
MYG_PHYCA
GLB5_PETMA
LGB2_LUPLU
--------VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN--EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
*: :
:
* .
: .:
* :
* :
.
HBB_HUMAN
HBB_HORSE
HBA_HUMAN
HBA_HORSE
MYG_PHYCA
GLB5_PETMA
LGB2_LUPLU
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL
PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL
----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL
----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL
EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF
ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV
VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV
. .:: *. :
.
: *. * .
: .
Center for Biological Sequence Analysis
Why multiple alignment is
better
More sequences contain more information
Multiple sequence alignment allows us to
compare all related proteins simultaneously
It allows us to identify features that are
conserved among the sequences
Using a multiple sequence alignment (a profile)
one can find more related sequences than by
simple pairwise comparison
Center for Biological Sequence Analysis
Coffee break
Coffee break quiz:
Why is the lasT gene in
E. coli called lasT?
Did some researchers
fail to get the joke?
Center for Biological Sequence Analysis
An iterative scheme for using
profiles in database searches
Search your sequence against a large database
using a pairwise alignment method (often
BLAST) to obtain a set of closely related
sequences
Make a multiple sequence alignment (using
ClustalW) and estimate a profile
Search the profile against the database in an
attempt to find more distantly related sequences
Include these in the profile and redo the profile
search
Center for Biological Sequence Analysis
If only life was so simple …
In the databases one may find large cluster of
almost identical sequences
• These will heavily bias the profile towards ”their
sequence”
• To avoid this a sequence weighing scheme must be
used during construction of the profile
How should one estimate the frequencies of rare
mutations that have not been observed
• A more general problem: What to do when you have too
few observations to make a reliable estimate of a
frequency
• The solution is called regularization which involved
using prior knowledge on mutations (such as
substitution matrices)
Center for Biological Sequence Analysis
Regularization by pseudo
counts
In addition to the real counts actually observed
in the sequences some extra pseudo counts are
added
The simplest approach is to simply add 1 to all
counters before calculating sequences
PSI-BLAST adds pseudo counts based on
observations multiplied by a substitution matrix
• This means that pseudo counts are mainly added to
the amino acids which are similar to the observed ones
• The number of pseudo counts is adjusted so that
pseudo counts are mainly used when few real counts
Center for Biological Sequence Analysis
An overview of PSI-BLAST
A fast heuristic method for doing profile
searches which is almost as good as ”the real
thing”
Outline of the algorithm
• First ordinary BLAST is used to find close homologs
• Rather than making a real multiple alignment the close
homologs are all just aligned to the query sequence (a
master-slave alignment)
• A profile is constructed using a very simple empirical
weighing scheme combined with substitution matrix
pseudo-counts
• Ignoring the positional variation of indels the profile is
again searched against the database
Center for Biological Sequence Analysis
PSI-BLAST’s E-values
BLAST generally tends to overestimated the
significance of database hits
PSI-BLAST E-values are not the E-value of the
query sequence matching the database sequence
Instead the E-values represent the expectation
value of the profile matching the database
sequence
The profile might be wrong due the spurious hits
in earlier iterations!
Center for Biological Sequence Analysis
Using PSI-BLAST
We will now use PSI-BLAST to find more
homologs to our query sequence
Again the emphasis is on the interpretation of
the results
Open a web browser on the page
http://www.cbs.dtu.dk/dtucourse/cookbooks/ljj/exercise3.html
Center for Biological Sequence Analysis
Conserved domain BLAST
PSI-BLAST attempts to build a profile for the
query sequence and search it against a sequence
database
CD-BLAST instead builds a database of profiles
and searches the query sequence against this
• This means that CD-BLAST is not iterative and thus
faster
• CD-BLAST works for sequences with no close homologs
• The profiles come from the PFAM database which is
checked by experts to make sure that no unrelated
sequence are included in the profiles
• However CD-BLAST can only identify conserved
domains which are in the PFAM database
Center for Biological Sequence Analysis
Is it really worth the trouble?
Yes!
Profile based search methods like PSI-BLAST
have been shown to find ~3 times as many
homologs without increasing the number of false
positives
This essentially translates into three times
higher chance of finding a homolog with known
structure or function
Using profiles rather than single sequences
improved secondary structure prediction by
~10%
Center for Biological Sequence Analysis
Searching for conserved protein
domains using CD-BLAST
We will now use CD-BLAST to see if the query
sequence has matches to any known protein
families
The results of this search should be compared to
those found using PSI-BLAST
Open a web browser on the page
http://www.cbs.dtu.dk/dtucourse/cookbooks/ljj/exercise4.html
Center for Biological Sequence Analysis
Important things to remember
when using alignment to search
databases
When searching in databases, size does matter!
• Searching large databases take very long time
• The significance of matches drops when the database is
expanded
Doing things differently can lead to different
conclusions
• Nucleotide comparison vs. protein comparison
• CD-BLAST vs. PSI-BLAST
Think before and after you search
• The obvious thing to do is not always the right thing to
do
Download