Center for Biological Sequence Analysis Database Searching Using alignment algorithms for finding similar sequences Center for Biological Sequence Analysis Why do we want to compare sequences? Evolutionary relationships • Phylogenetic trees can be constructed based on comparison of the sequences of a molecule (example: 16S rRNA) taken from different species • Residues conserved during evolution play an important role Prediction of protein structure and function • Proteins which are very similar in sequence generally have similar 3D structure and function as well • By searching a sequence of unknown structure against a database of known proteins the structure and/or function can in many cases be predicted Center for Biological Sequence Analysis Things to keep in mind when working with alignments Pairwise alignment programs always find the optimal alignment of two sequences • They do so even if it does not make any sense at all to align the two sequences • ”Optimal” means optimal according to the substitution matrix and gap penalties you choose – also if you choose the wrong ones Generally the underlying assumptions are wrong • The frequency of substitution is not the same at all positions • Nor is the frequencies of insertions and deletions the same • Affine gap penalties do not properly model indel events Center for Biological Sequence Analysis Using sequence alignment to search databases The most common usage of pairwise sequence alignment is searching databases for related sequences Although the alignments themselves may be unreliable the alignment scores gives a lot of information about which sequences are related and which are not Having a set of related sequences is a lot more informative than just one sequence - even if nothing is known about the related sequences Center for Biological Sequence Analysis Requirements in addition to an alignment method A very fast method to find potentially related sequences • Systematically searching through the databases with the alignment methods take too long even though dynamic programming is fast • Some method to initially identify possible matches is therefore needed to speed up the search A method to evaluate which matches to trust • Statistics on the alignment score distributions can be used to calculate the significance of an alignment • This way we can not only rank which matches are better than others but also tell if any of them are good at all Center for Biological Sequence Analysis Local or global alignment Generally local alignment is used for performing database searches • For most cases you would be interested in knowing if any parts of you sequences looks like something else • The protein sequence databases have not been split into domains It is not always the optimal thing to do but … • In the case where the complete sequence should match the local alignment score will be almost identical to the global one • If you really want a global alignment you can make it afterwards Center for Biological Sequence Analysis Differences between global and local alignments Extra constraint on scoring function: The expected score for a random alignment must be negative Because you can to start a new alignment anywhere dynamic programming scores cannot become negative The trace-back is started at the highest values rather than the lower right corner The trace-back is stopped as soon as a zero is encountered Center for Biological Sequence Analysis The Smith-Waterman algorithm (local alignment) Center for Biological Sequence Analysis Alignment score distributions The local similarity scores for ungapped alignment of random sequences can be shown to follow an extreme value distribution: P(Sx) = 1-exp(-Kmne-x), where m and n are the sequence lengths while K and are free parameters This turns out to be a very good approximation for gapped alignment as long as reasonably large gap penalties are used Center for Biological Sequence Analysis Database searching Positive reporting: When searching in a database we report only the few good matches The expected number of database hits with a score of at least x can be calculated as: E(Sx) = DP(Sx), where D is the number of entries in the database E-values are much better for evaluating alignments than raw alignment scores or ”percent identity” Center for Biological Sequence Analysis A curse or a blessing? Large databases are a blessing … • They are more likely to contain something similar to the query … and a curse • Increasing the size of the database decreases the significance of the hits you get • Searching huge databases requires fast computers What requirements this puts on software development • The programs must be speeded up or database searches will take longer and longer • The false positive rate must be reduced to not lose specificity Center for Biological Sequence Analysis Heuristic search algorithms FASTA (Pearson 1995) Uses heuristics to avoid calculating the full dynamic programming matrix Speed up searches by an order of magnitude compared to full SmithWaterman The statistical side of FASTA is still stronger than BLAST BLAST (Altschul 1990, 1997) Uses rapid word lookup methods to completely skip most of the database entries Extremely fast • One order of magnitude faster than FASTA • Two orders of magnitude faster than SmithWaterman Center for Biological Sequence Analysis Coffee break Top 10 ways to tell you drink too much coffee 10 Juan Valdez names his donkey after you 9 You get a speeding ticket even when you're parked 8 You grind your coffee beans in your mouth 7 You sleep with your eyes open 6 You watch videos in fastforward 5 You lick your coffeepot clean 4 Your eyes stay open when you sneeze 3 The nurse needs a scientific calculator to take your pulse 2 You can type sixty words a minute with your feet 1 You can jump-start your car without jumper cables. Center for Biological Sequence Analysis How BLAST works The search is speeded up by indexing the sequence databases in a so-called suffix array • Three letter subsequences are used as keys to the sequences • Closely related substitutions are also included • This gives ~150 index keys for each sequence This is used in two ways • To quickly discard sequences that are not similar at all before even beginning to align them • To constrain the alignment and thereby speed up the alignment procedure itself Center for Biological Sequence Analysis Variations on a theme BLASTN • Nucleotide query sequence • Nucleotide database TBLASTN • Protein query sequence • Nucleotide database • ”On the fly” six frame translation of database BLASTP • Protein query sequence • Protein database BLASTX • Nucleotide query sequence • Protein database • Compares all six reading frames with the database TBLASTX • Nucleotide query sequence • Nucleotide database • Compares all reading frames of query with all reading frames of the database Center for Biological Sequence Analysis BLAST at NCBI http://www.ncbi.nlm.nih.gov/BLAS T/ Very fast computer dedicated to running BLAST searches Many databases that are always up to date Nice simple web interface But you still need to knowledge about BLAST to use it Center for Biological Sequence Analysis Performing a simple BLAST search We will now do a small exercise together The purpose of the exercise is simply to performing a simple BLAST search ”hands on” Open a web browser on the page http://www.cbs.dtu.dk/dtucourse/cookbooks/ljj/exercise1.html Center for Biological Sequence Analysis The most common and effective way to ruin your database search What you should never ever do: take the nucleotide sequence of a gene and compare it with a database at the nucleotide level • Unfortunately this is a very intuitive thing to do • On the NCBI BLAST homepage nucleotide search methods are listed before protein search – making it even more intuitive What you should do instead • Extract the coding part of the DNA sequence, translate it, and search with the resulting protein sequence • Use a search method (such as BLASTX or TBLASTX) which compares the sequences at the protein level Center for Biological Sequence Analysis The limits of sequence similarity Center for Biological Sequence Analysis Expectation values in BLAST BLAST uses precomputed extreme value distributions to calculate E-values from alignment scores • For this reason BLAST only allows certain combinations of substitution matrices and gap penalties • This also means that the fit is based on a different data set than the one you are working on A word of caution: BLAST tends to overestimate the significance of its matches • E-values from BLAST are fine for identifying sure hits • One should be careful using BLAST’s E-values to judge if a marginal hit can be trusted Center for Biological Sequence Analysis Evaluating BLAST results We will now do a second exercise together The main point of this exercise is careful interpretation of the BLAST output Open a web browser on the page http://www.cbs.dtu.dk/dtucourse/cookbooks/ljj/exercise2.html Center for Biological Sequence Analysis Pairwise alignment of hemoglobin alpha chain and myoglobin 24.7% identity; Global alignment score: 130 10 20 30 40 50 HBA_HU VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKG--::: .. : .:.:: : .. .: . : :.: : : : : .: . :..:. MYG_PH VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED 10 20 30 40 50 60 60 70 80 90 100 110 HBA_HU ---HGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNF-KLLSHCLLVTLAAHL :: : :: . . :. :.. :: : .. :... ...:. .. .: .. MYG_PH LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKI-PIKYLEFISEAIIHVLHSRH 70 80 90 100 110 120 130 140 HBA_HU PAEFTPAVHASLDKFLASVSTVLTSKYR-----:..: ......: : ...::. MYG_PH PGDFGADAQGAMNKALELFRKDIAAKYKELGYQG 120 130 140 150 Center for Biological Sequence Analysis A multiple sequence alignment of globins HBB_HUMAN HBB_HORSE HBA_HUMAN HBA_HORSE MYG_PHYCA GLB5_PETMA LGB2_LUPLU --------VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN--EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . HBB_HUMAN HBB_HORSE HBA_HUMAN HBA_HORSE MYG_PHYCA GLB5_PETMA LGB2_LUPLU PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Center for Biological Sequence Analysis Why multiple alignment is better More sequences contain more information Multiple sequence alignment allows us to compare all related proteins simultaneously It allows us to identify features that are conserved among the sequences Using a multiple sequence alignment (a profile) one can find more related sequences than by simple pairwise comparison Center for Biological Sequence Analysis Coffee break Coffee break quiz: Why is the lasT gene in E. coli called lasT? Did some researchers fail to get the joke? Center for Biological Sequence Analysis An iterative scheme for using profiles in database searches Search your sequence against a large database using a pairwise alignment method (often BLAST) to obtain a set of closely related sequences Make a multiple sequence alignment (using ClustalW) and estimate a profile Search the profile against the database in an attempt to find more distantly related sequences Include these in the profile and redo the profile search Center for Biological Sequence Analysis If only life was so simple … In the databases one may find large cluster of almost identical sequences • These will heavily bias the profile towards ”their sequence” • To avoid this a sequence weighing scheme must be used during construction of the profile How should one estimate the frequencies of rare mutations that have not been observed • A more general problem: What to do when you have too few observations to make a reliable estimate of a frequency • The solution is called regularization which involved using prior knowledge on mutations (such as substitution matrices) Center for Biological Sequence Analysis Regularization by pseudo counts In addition to the real counts actually observed in the sequences some extra pseudo counts are added The simplest approach is to simply add 1 to all counters before calculating sequences PSI-BLAST adds pseudo counts based on observations multiplied by a substitution matrix • This means that pseudo counts are mainly added to the amino acids which are similar to the observed ones • The number of pseudo counts is adjusted so that pseudo counts are mainly used when few real counts Center for Biological Sequence Analysis An overview of PSI-BLAST A fast heuristic method for doing profile searches which is almost as good as ”the real thing” Outline of the algorithm • First ordinary BLAST is used to find close homologs • Rather than making a real multiple alignment the close homologs are all just aligned to the query sequence (a master-slave alignment) • A profile is constructed using a very simple empirical weighing scheme combined with substitution matrix pseudo-counts • Ignoring the positional variation of indels the profile is again searched against the database Center for Biological Sequence Analysis PSI-BLAST’s E-values BLAST generally tends to overestimated the significance of database hits PSI-BLAST E-values are not the E-value of the query sequence matching the database sequence Instead the E-values represent the expectation value of the profile matching the database sequence The profile might be wrong due the spurious hits in earlier iterations! Center for Biological Sequence Analysis Using PSI-BLAST We will now use PSI-BLAST to find more homologs to our query sequence Again the emphasis is on the interpretation of the results Open a web browser on the page http://www.cbs.dtu.dk/dtucourse/cookbooks/ljj/exercise3.html Center for Biological Sequence Analysis Conserved domain BLAST PSI-BLAST attempts to build a profile for the query sequence and search it against a sequence database CD-BLAST instead builds a database of profiles and searches the query sequence against this • This means that CD-BLAST is not iterative and thus faster • CD-BLAST works for sequences with no close homologs • The profiles come from the PFAM database which is checked by experts to make sure that no unrelated sequence are included in the profiles • However CD-BLAST can only identify conserved domains which are in the PFAM database Center for Biological Sequence Analysis Is it really worth the trouble? Yes! Profile based search methods like PSI-BLAST have been shown to find ~3 times as many homologs without increasing the number of false positives This essentially translates into three times higher chance of finding a homolog with known structure or function Using profiles rather than single sequences improved secondary structure prediction by ~10% Center for Biological Sequence Analysis Searching for conserved protein domains using CD-BLAST We will now use CD-BLAST to see if the query sequence has matches to any known protein families The results of this search should be compared to those found using PSI-BLAST Open a web browser on the page http://www.cbs.dtu.dk/dtucourse/cookbooks/ljj/exercise4.html Center for Biological Sequence Analysis Important things to remember when using alignment to search databases When searching in databases, size does matter! • Searching large databases take very long time • The significance of matches drops when the database is expanded Doing things differently can lead to different conclusions • Nucleotide comparison vs. protein comparison • CD-BLAST vs. PSI-BLAST Think before and after you search • The obvious thing to do is not always the right thing to do