EMBL-EBI Sequence Searching and Alignments, 2013 EMBL-EBI Contents USING ALIGNMENT TOOLS AT EMBL-EBI .......................................................................................................... 4 Step 1 – input ................................................................................................................................................. 5 Step 2 –parameters ........................................................................................................................................ 5 Step 3 – submit .............................................................................................................................................. 6 RESULTS ................................................................................................................................................................ 7 Results page ................................................................................................................................................... 7 Alignment ....................................................................................................................................................... 8 Submission details .......................................................................................................................................... 8 Implementing the Methods for Sequence Searching Tools: BLAST .............................................................. 12 Implementing the Methods for Sequence Searching Tools: FASTA .............................................................. 14 BLAST & FASTA Sensitivity ............................................................................................................................ 14 Sequence Searching Similarity tools at the EBI ............................................................................................ 16 USING FASTA ................................................................................................................................................. 18 Step 1 – database selection ......................................................................................................................... 19 Step 2 – input ............................................................................................................................................... 20 Step 3 –parameters ...................................................................................................................................... 20 Step 4 – submit ............................................................................................................................................ 21 RESULTS .............................................................................................................................................................. 21 Summary Table ............................................................................................................................................ 22 Tool Output .................................................................................................................................................. 22 Visual Output ............................................................................................................................................... 23 Functional Predictions .................................................................................................................................. 24 Submission Details ....................................................................................................................................... 24 INTERPRETING AN ALIGNMENT ................................................................................................................................. 25 USING BLAST.................................................................................................................................................. 28 NCBI BLAST parameters ............................................................................................................................... 28 WU-BLAST parameters................................................................................................................................. 29 BLAST RESULTS .................................................................................................................................................... 29 DIFFERENCES BETWEEN BLAST AND FASTA .............................................................................................................. 32 When to use what? ...................................................................................................................................... 32 PSI-BLAST Threshold ..................................................................................................................................... 34 PROBLEMS WITH ITERATIVE SEARCHES ......................................................................................................... 37 HOMOLOGOUS OVER-EXTENSION (HOE) ....................................................................................................... 37 FILTERS .......................................................................................................................................................... 38 DATABASE COMPOSITION ............................................................................................................................. 40 VECTOR CONTAMINATION ............................................................................................................................ 41 MULTIPLE SEQUENCE ALIGNMENT................................................................................................................. 42 CLUSTALW2 ................................................................................................................................................... 43 RESULTS .............................................................................................................................................................. 46 Alignments ................................................................................................................................................... 46 Result Summary ........................................................................................................................................... 47 Jalview .......................................................................................................................................................... 47 2 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Guide Tree .................................................................................................................................................... 48 Phylogenetic Tree ......................................................................................................................................... 49 Submission Details ....................................................................................................................................... 50 Clustal Omega .............................................................................................................................................. 54 T-Coffee ........................................................................................................................................................ 54 MUSCLE ........................................................................................................................................................ 54 MAFFT .......................................................................................................................................................... 54 Kalign ........................................................................................................................................................... 54 PHYLOGENY ................................................................................................................................................... 57 CLUSTALW2 - PHYLOGENY...................................................................................................................................... 57 Saving phylogenetic trees ............................................................................................................................ 58 WEBPRANK .................................................................................................................................................... 59 Results .......................................................................................................................................................... 60 RELATED ARTICLES FROM THE EBI ................................................................................................................. 62 FURTHER READING ........................................................................................................................................ 62 Information: Course materials (including a copy of this manual) can be downloaded from: http://www.ebi.ac.uk/~apc/Courses/Immuno/ 3 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Using alignment tools at EMBL-EBI There are a variety of sequence alignment tools available at EMBL-EBI. The simplest tools are pairwise alignment tools. Instruction: Navigate to the EBI pairwise sequence alignment page You can either type in an address directly (www.ebi.ac.uk/Tools/psa/), search with the term PSA, or go through the services list as demonstrated. You will be faced with a choice of programs, split into global, local or genomic alignments. Once you chose a program, the input page looks like this: 4 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Input for pairwise alignments is very simple. Step 1 – input Here you can enter your two sequences. Most common formats are recognised, but please don’t try to invent your own using Word! You can either paste/type the sequence directly, or upload a file containing it from your computer using the browse button. Step 2 –parameters The program will be set up with some default parameters, however you can change them if you wish. You can click on any parameter title to get help on it. The main options are: 5 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Matrix The comparison matrix to be used to score alignments when searching the database Gap open Penalty to start a gap in the alignment Gap extend Penalty for each base or residue in the gap Step 3 – submit When you’re happy with everything else, select submit to run the job. 6 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Results Results page The results page is fairly simple. At the top of the page are some tabs that switch between the alignment, submission details, and the form to submit a new job. The parameters used for the job 7 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI are shown, and there is a link to the output file. The output file text is also displayed further down the page. Alignment This page shows the results of the alignment. The table gives a summary of the results: Length reports the length of alignment. Identity reports the number of identical residues that are found aligned between the two sequences. Similarity reports the number of aligned residues that score positively in the substitution matrix (ie similar types of residues). Gaps reports the number of gaps inserted into the alignment. Score gives the literal score of the alignment as worked out by the algorithm. Then the alignment itself is displayed: The top line is the first sequence and the bottom is the second sequence. Gaps in sequence are displayed with a ‘–‘ character. Where two identical residues line up they are connected with a ‘|’ where two very similar residues line up they are connected with a ‘:’. Where less well conserved substitutions are made they are connected with a ‘.’. Submission details This page gives details of how the program was run. It tells you what version of the tool was run, when it was launched, details of your input and the tool output, as well as the original command line used to launch the job and details of the selected parameters. These can all be useful if you need to recreate a job on your local machine, or to repeat an alignment in the future. This page is also very useful to us if you have a problem and need to contact us. 8 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Instruction: Try running your own global alignment using the provided sequences: Use the EMBOSS needle program Leave the parameters set at their defaults Question 1 (global): What is the Length, Score, %identity, %similarity and %gaps of the alignment? Length Score Identity% Similarity% Gaps% Now let’s try a local alignment. Instruction: Try running your own local alignment using the provided sequences: This time use the EMBOSS water program Leave the parameters set at their defaults 9 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Question 2 (local): What is the Length, Score, %identity, %similarity and %gaps of the alignment? Length Score Identity% Similarity% Gaps% Question 3: In words, how would you describe the key differences between the global and local alignment results? Can you think of ways to improve the alignment? Now let’s try changing the matrix parameter. Instruction: Try running your own local alignment using the provided sequences: Use the EMBOSS water program Change the matrix to BLOSUM 40, and leave the other parameters at their defaults 10 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Question 4: What is the Length, Score, %identity, %similarity and %gaps of our new alignment? Can you describe what has happened? Length Score Identity% Similarity% Gaps% 11 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Alignment against databases While it’s possible (and very accurate) to run optimal alignments against databases (using SSearch or GGsearch at the EBI for example), the computational requirements are such that it takes a very long time, and uses a large amount of memory. It is more practical to use a heuristics based method such as BLAST or FASTA. Implementing the Methods for Sequence Searching Tools: BLAST BLAST, which stands for Basic local alignment tool and was developed by Altschul and colleagues in 1990. BLAST uses an approximation of the Smith–Waterman algorithm which makes is quite fast, however this gain in speed is offset by a decrease in accuracy. Unlike the true Smith–Waterman algorithm, BLAST is not guaranteed to find the optimal alignment between your query sequence and the test sequences. However, it will find good alignments and provides a statistical means of gauging your confidence in each alignment: (1) It searches for ‘words’ of a user-defined length (the shorter the word, the more sensitive the search). (2) It then extends these words in both directions until it finds a mismatch. (3) It then performs an approximation of the Smith–Waterman algorithm to create a gapped alignment between the query sequence and the test sequence. (4) Finally it calculates and reports the probability of the alignment occurring by chance. 12 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI When doing a Blast search one can set the “expectation threshold” (EXP THR) which establishes a statistical significance threshold for reporting database sequence matches. EXP THR is interpreted as the upper limit on the expected frequency of a chance occurrence of a match within the context of the entire database search; in other words, it sets an upper limit on the E-value. Any database sequence whose BLAST alignment to the query sequence satisfies EXP THR is reported in the output file. An alignment with an E value of ≥1.0 is expected to be found at least once by chance in the searched database and an E value of ≥5.0 is expected to be found at least five times (see figure below). Raising this threshold increases the likelihood of reporting distantly related matches, but the frequency of chance matches reported will tend to grow at a much faster rate than real matches with EXP THR set above 1.0. 13 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Implementing the Methods for Sequence Searching Tools: FASTA David Lipman and William Pearson (1988) developed FASTA which gets around the speed problem by: (1) FASTA breaks the query and test sequences into overlapping words and looks for exact matches. (2) Then it re-scores these matches using a substitution matrix. (3) Next it tries to join the highest-scoring segments. A ‘joining threshold’ set by the user eliminates segments that are unlikely to be part of the alignment. (4) Finally, FASTA uses the Smith–Waterman method to optimise the alignment, using only the part of the matrix that contains the top-scoring segments. BLAST & FASTA Sensitivity 14 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI FASTA therefore provides a means of performing a sensitive search against a large database in a reasonable time. Nowadays, it is possible to approach the sensitivity of a FASTA search by using BLAST and setting high sensitivity values. However, the Sensitivity: There is a trade-off between alignment statistics for FASTA can be considered more sensitivity and search speed. Increasing the robust than those for gapped-BLAST. This is because sensitivity makes the search more FASTA produces and scores an alignment of the query computationally intensive and therefore slower. sequence with a large sampling of the database, giving it Decreasing the sensitivity, for example when a distribution of scores that represents the entire you are looking for almost exact matches, can database:sequence range of alignments. BLAST is fast dramatically increase the search speed. There is because it does not bother producing an alignment for also a trade-off between sensitivity and most database sequences; alignments are only specificity: increasing sensitivity tends to triggered if the initial word-match criteria are met. decrease specificity (greater propensity for Consequently, BLAST does not have a complete chance matches). So for example: if you are distribution of alignment scores over the database from looking for vector contamination you will which to calculate the significance of the reported choose low sensitivity, whereas if you are matches. Instead, BLAST uses pre-computed values for looking for long distance related sequence you will opt for high sensitivity. the score distribution rather than calculating values that are specific to the search carried out. 15 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Sequence Searching Similarity tools at the EBI The EBI provides you the option to use several sequence searching similarity tools at www.ebi.ac.uk/Tools/sss. These are maintained by the External Service (ES) team. The ES team puts considerable effort in making sure to provide you with the state of the art tools in sequence searching and also with the flexibility to tailor your queries to the most appropriate search parameters. Therefore you will find several options not only for the types of tools but also within the tools, the amount of variable and parameters you can change. 16 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI We have talked about BLAST and FASTA but what are all the variations above? A quick way to make a distinction among these tools is to label them by being either heuristic or rigorous. A fundamental challenge in computer science is to make algorithms that find verifiable good solutions using a proved bounded amount of computation time. A heuristic algorithm gives up one or both of these goals. In other words heuristic is an algorithm that is able to produce an acceptable solution to a problem in many practical scenarios, but for which there is no formal proof. Heuristics are typically used when there is no known method to find an optimal solution, under the given constraints (of time, space etc.) or at all. Rigorous on the other hands means to applying an algorithm that can produce proof, that gives you the most optimal solution, therefore it is exhaustive and should provide you with the best answer, however it is slow. Following these definitions, the Sequence Searching Tools mentioned above are label as shown on this panel: 17 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Using FASTA Let’s have a go using FASTA for ourselves. Instruction: Navigate to the EBI sequence search tools page You can either type in an address directly (www.ebi.ac.uk/Tools/sss/) search with the term FASTA, or go through the services list as demonstrated. The framework from FASTA is launched by picking a type of database listed next to the FASTA tool – this sets up the defaults appropriately for your choice, but you can always change the database once in the tool if you wish. Instruction: Select the Protein database for FASTA which we launch our tools was revamped recently, and now forms a common basis for many programs. For more information about this framework see: A new bioinformatics analysis tools framework at EMBL–EBI (2010) Goujon et al. doi:10.1093/nar/gkq313 You should end up at a screen that looks like the following: 18 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI This screen will hopefully look quite familiar to you when conducting other searches as well. There are four steps to submitting a job. Step 1 – database selection Here is where you select which databases to search against. You can select more than one, and you can also expand subsections to narrow your search by taxonomic division for example. If you changed your mind and want to do a different type of search (for example, against nucleotide sequences) then you can select that here as well and it will reset the form. 19 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Step 2 – input Here you can enter your query sequence. Most common formats are recognised, but please don’t try to invent your own using Word! You can either paste/type the sequence directly, or upload a file containing it from your computer using the browse button. Step 3 –parameters Most important here is choice of program – FASTA is grouped together with others like SSEARCH (because they were authored by the same person and come in the same package). It should be set up correctly according to the button you pressed to get to this page. The program will be set up with some default parameters, however to look at these or change them you need to click the ‘More options’ button. You can click on any parameter title to get help on it. Here is a summary: Matrix The comparison matrix to be used to score alignments when searching the database Gap open Penalty to start a gap in the alignment Gap extend Penalty for each base or residue in the gap Ktup ‘Word size’ used to identify runs in the first stage of alignment Expectation upper value This allows you to ignore results that have above a certain expectation score (ie become more distant) Expectation lower value This allows you to ignore results below a certain expectation (ie ignore close relatives) DNA strand When searching DNA you can specify which strand is used. By default both are searched Histogram Turn on/off the display of statistical histogram in FASTA results Filter Which low complexity filter to use Statistical estimates Which statistical method to use to evaluate values used in the Expect score calculation Scores Maximum number of scores reported in the summary Alignments Maximum numbers of alignments reported in the summary Sequence range Allows you to specify which portion of the query sequence to use in the search Database range Allows you to cut back on database sequences searched against by specifying a number of residues range 20 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Step 4 – submit When you’re happy with everything else, here is where you click to submit the job. For longer runs it is recommended that you tick the box to send you an email when the job is complete. This will contain a link back to the results so there is no need to keep your browser open. Email jobs are usually stored on the servers for longer as well, while interactive job results are deleted more quickly. Once you’ve submitted your job the first thing that happens is that the input is validated – few things are more frustrating than preparing a job and firing it off to then check your email later in the day and find that you made a minor mistake somewhere and the job failed. If everything is okay then the job will run and you will see a job running screen if you ran it interactively. Eventually the job will finish and you will either be taken to the results page automatically or emailed a link to it. Results 21 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Summary Table There is a lot of information to take in for the results page! You are first presented with the summary table which quickly lists the top results in a table format. You can change the ordering of the table by clicking the arrows next to each column header – by default they are ranked by Expect value or E(). The first column contains a tick box which allows you to select database results for further actions, for example to view the sequence annotation or detailed alignment with the query sequence. You can also use the buttons to clear selections, select all, or invert selection. To download the selected sequences click the download button. The second column (DB:ID) gives the database ID of the sequence. The third column (Source) gives some quick information about the sequence, as well as cross references that have been found referring to it in other resources across the EBI – so you can quickly look up more information about the sequence in these resources. Length reports the length of the database sequence. Score gives the literal score of the alignment. Identities reports the number of identical residues that are found aligned between the query and database sequence. Positives reports the number of aligned residues that score positively in the substitution matrix (ie similar types of residues). E() gives the Expect score for the alignment – this is a measure of how likely you are to find that alignment by chance. When the numbers are very small it reports them as 1.0E-10 for example. This is the same as 1e-10, or 0.0000000001. Tool Output This tab switches the view to the raw, original output from FASTA – this can be useful when you want to view the full text output from the program in case it contains something the summary or other pages don’t cover. You can download it as a text or XML file. Clicking on the icon will jump you straight to the alignment. Clicking on the sequence ID will take you to the original database entry for that sequence. 22 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Visual Output This output gives you a nice way of visualising which portions of the sequence are aligning, as well as colour-coding the alignment by Expect score. Hovering your mouse over the sequence ID on the left-hand side will show a guide box around the alignment. Clicking the mouse will take you to the alignment (from the raw output of the program). 23 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Functional Predictions This tab is another example of how we are bringing together different resources at the EBI to give you extra information. It searches a variety of resources for family and domain predictions and shows you the results graphically, so you can easily see which portions of your sequence and alignments correspond to these features. Again, you can click on the links on the left-hand side to jump to the entry information at the resource. Submission Details This page contains all the original parameters used to launch your job, together with easy links to the exact input used and the output results. This information is really useful for several reasons: If you want to repeat a job then you might want to use exactly the same parameters; if you’re interested in running the command line version of the tool then this will give you the exact command line used; and finally if you need help or support then this page contains all of the information you need to give us to be able to help you quickly. 24 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Interpreting an alignment The figure below shows a typical alignment Key - Gap : Identity . Similarity X Filtered The header shows some information about the database sequence, followed by some of the raw scores from the program itself. The key bits of information are the E() score, the % identity and % similarity numbers. Below that is the actual alignment itself – the top line is the query sequence and the bottom is the database sequence. Gaps in sequence are displayed with a ‘–‘ character (none in the above alignment). Where two identical residues line up they are connected with a ‘:’ where two similar residues line up they are connected with a ‘.’ Instruction: Try running your own FASTA search using the provided sequence: Search against the UniProtKB/Swiss-Prot database only – if you search against the full UniProt Knowledgebase it will take a very long time! Make sure the FASTA program is selected Leave the parameters set at their defaults 25 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Question 5: What are the default gap open, gap extend, ktup and matrix parameter settings for this search? Question 6: What can you say about the likely function of this protein? Question 7: What are the DB:IDs, Scores, %identities, %positives and E() of the top two results? ID Score Identity% Positives% E() 26 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Question 8: Have a look at the actual alignments with the top two results – what can you say about them? 27 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Using BLAST Now that we are familiar with running FASTA searching, using BLAST should be very easy – the interface is effectively the same. Simply select a database next to the BLAST tool of interest to enter the tool. NCBI BLAST This version of BLAST is the version maintained at the NCBI WU-BLAST This version of BLAST was created by Dr Warren Gish, formerly of Washington University Both versions can trace their history back to the same algorithm, but were developed separately, often implementing ideas that first appeared in the other version. The parameters are handled slightly differently in each case: NCBI BLAST provides access to parameters like those used by FASTA or SSEARCH (eg gap open, gap extend). WU-BLAST hides direct access to those, but instead provides a sensitivity parameter which combines several adjustments at different stages of the algorithm. The raw results are slightly different between the two, but as we parse the results they will appear the same to you (unless you look at the Tool Output page). NCBI BLAST parameters Matrix The comparison matrix to be used to score alignments when searching the database Gap open Penalty to start a gap in the alignment Gap extend Penalty for each base or residue in the gap Exp. Thr (expectation threshold) This allows you to ignore results that have above a certain expectation score (ie become more distant) Filter Which low complexity filter to use Drop off Controls how far a potential HSP is allowed to extend Scores Maximum number of scores reported in the summary Alignments Maximum numbers of alignments reported in the summary Sequence range Allows you to specify which portion of the query sequence to use in the search Gap align Allows gapped extensions of alignments Alignment views Options for formatting the alignment output (Tool Output) 28 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI WU-BLAST parameters Matrix The comparison matrix to be used to score alignments when searching the database Exp. Thr (expectation threshold) This allows you to ignore results that have above a certain expectation score (ie become more distant) Filter Which low complexity filter to use View filter Display any sequence filtered out (in Tool Output) Sensitivity General parameter affecting search sensitivity – this makes adjustments to several internal parameters Scores Maximum number of scores reported in the summary Alignments Maximum numbers of alignments reported in the summary Sort Choose which value to sort the Tool Output results by. Stats Choice of statistic methods used in generation of Expect statistics topcomboN In WU-BLAST HSPs are classified into a number of sets, you can use this parameter to restrict the display to only the N highest scoring sets. Alignment views Options for formatting the alignment output (Tool Output) BLAST results Unsurprisingly, the results from BLAST runs on our servers are displayed in the same way as FASTA results, and all the tabs are equivalent. The main differences in format will only appear if you look at the raw output via the Tool Output page. 29 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Key - Gap [residue] Identity + Similarity X Filtered The above picture shows the alignment section of an NCBI BLAST run. This time identical residues are actually listed between the two sequences, which are labelled Query for query sequence and Sbjct (subject) for database sequence. Similar residues (Positives) are indicated with a +. The number of gaps inserted into the overlapping sequence regions is also reported. Instruction: Try running your own NCBI BLAST search using the provided sequence: Search against the UniProtKB/Swiss-Prot database only – if you search against the full UniProt Knowledgebase it will take a very long time! Make sure the BLASTP program is selected Leave the parameters set at their defaults 30 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Question 9: What are the default gap open, gap extend, drop off and matrix parameter settings for this search? Question 10: What are the DB:IDs, Scores, %identities, %positives and E() of the top two results? ID Score Identity% Positives% E() Question 11: Are these different from our FASTA search earlier? 31 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Differences between BLAST and FASTA The following table summarises the key differences between BLAST and FASTA BLAST FASTA Fast Good with proteins Might miss potential alignment Not as fast as BLAST Good with proteins and DNA Aligns against all database sequences Produces S&W alignments Good for cousins Produces HSPs Good for siblings When to use what? In general the larger the database the faster the algorithm you should use, and likewise the larger the query sequence the faster the algorithm you should use. For very small queries or databases then dynamic programming methods like SSEARCH can be great. There is another reason you might want to choose FASTA based tools (including -SEARCH tools), which is to do with the way the statistics and thus significance are calculated. More on that later. 32 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI PSI-BLAST Position-Specific Iterative BLAST, or PSI-BLAST, is a clever tool which allows you to create your own custom scoring matrix based on the conservation of residues you find in your own searches, rather than some model made with different sequences. PSI-BLAST workflow It starts with a normal BLAST, however you can then 1. Normal BLAST search select which sequences in the results will be used to build a profile. These sequences are then aligned and conserved residues at each position are scored more highly in a new type of matrix which allows for different scores at different positions in the sequence. A new BLAST search is run with this matrix, called a Position Specific Scoring Matrix or PSSM. The results can themselves be used to create another PSSM for another run, and so the process is iterative. The aim of PSI-BLAST is concentrate the alignment on positions that are important, while allowing for more variability in areas that aren’t so important. So a functional area or binding motive might be more important than sequence that forms part of a loop for example. 2. Align selected results Searches made with a PSSM can find matches with sequences that were scored too low to be considered in a normal BLAST search, but have scored more highly with the new matrix – these are marked as ‘new’ by the PSI-BLAST tool. 3. Create PSSM You can also save your search and continue it at a later date, or save the PSSM itself. (Continued on next page) 33 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI PSI-BLAST workflow (cont.) The parameters for PSI-BLAST are the same as NCBI BLAST, with the addition of a new threshold: 4. Use PSSM to score new BLAST alignment PSI-BLAST Threshold This expectation value controls the default selection of sequences to be used for generation of the PSSM – sequences scoring higher than this (ie don’t align as well) won’t be included. Once the first iteration is run, additional controls over a normal BLAST result appear – the PSIBLAST Threshold can be changed again, or individual sequences can be added or removed from the selection by ticking the box in the first Summary Table column. Go to Step 2 if required The View Threshold limit button jumps the view down the table so you can see the cut off. 34 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Controls to download a checkpoint file or PSSM only appear after the second iteration. Instruction: Try running your own PSI-BLAST search using the provided sequence: Search against the UniProtKB/Swiss-Prot database only – if you search against the full UniProt Knowledgebase it will take a very long time! Make sure the PSI-BLAST program is selected Leave the parameters set at their defaults For the moment, stop after the first run 35 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Question 12: Looking at the first run (normal BLAST results), how many sequences score above our default threshold of 1.0e-3? Instruction: For the second iteration you can choose which sequences to include in the PSSM generation. Untick the top scoring sequence (simply because it scores so much better than the other results – you wouldn’t necessarily normally do this) Leave everything else set to defaults Click the ‘run next iteration’ button Question 13: Looking at the second iteration, how many sequences now score above our default threshold of 1.0e-3? (Hint: use the View Threshold Limit button). What is a likely explanation? Question 14: Have any new sequences been scored well enough to appear in our results? 36 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Problems with iterative searches Iterative searches like PSI-BLAST aim to use profile construction methods to increase search sensitivity, however they can inadvertently decrease selectivity, particularly if the profile becomes contaminated with information that is not relevant to homology with our query sequence. One situation where this occurs is described as Homologous Over-Extension(HOE). Homologous Over-Extension (HOE) Iterative search strategies using profiles (ie the PSSM in PSI-BLAST) might help increase the sensitivity of a search, however while the aim is to have the profile reflect areas of interest (a domain for example) there is a danger that it will be contaminated with information that is not relevant to your query. Low complexity regions are one example of this, but these can be fixed with the use of filters. Another cause of contamination that was recently described is Homologous OverExtension (HOE). HOE can occur in a profile based alignment when the alignment region picks up a portion of sequence that is not biologically relevant to our query but that is conserved in other sequences brought back by the search. The influence of this region on the scoring matrix can be such that the alignment region extends even further beyond the domain of interest. This can even begin to cover a domain that is not present in the query sequence, once this happens the weighting of the scoring matrix can influence the alignment so much that sequences not at all biologically related to the query start to be found as significant, resulting in an increase in false-positives. Ideally this is prevented by careful selection of which sequences to include in the generation of the PSSM and making sure that they do not have other domains near the boundaries of the alignment that might cause alignment extension – our functional prediction page might help with this. But as this is a manual method, and domain information might not be present in the functional predictions, we have created a method to automatically reduce the likelihood of HOE occurring by masking boundaries at the edge of the original alignment. At the moment this method only applies to PSISearch – a tool that combines sensitive Smith-Waterman based local alignment with the PSI-BLAST profile construction strategy, but it can be enabled by toggling the option ‘HOE region masking’ to yes (which is the default setting). 37 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Filters Filtering can eliminate statistically significant but biologically uninteresting reports from the BLAST output by masking out various segments of the query sequence for regions which are non-specific for sequence similarity searches. This leaves the more biologically interesting regions of the query sequence available for specific matching against database sequences. For example, it may be desired to mask acidic, basic or proline-rich segments of a protein that would otherwise yield overwhelming amounts of uninteresting, non-specific matches against a wide array of protein families. The SEG program (Wootton and Federhen, 1993) masks low compositional complexity regions, while XNU (Claverie and States, 1993) masks regions containing short-periodicity internal repeats. SEG+XNU will combine the above two. The DUST program by Tatusov and Lipman can only be used with DNA searches and will mask simple repeats in DNA/RNA sequences. 38 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Instruction: Perform a FASTA search against the UniProtKB/Swiss-Prot database for the filtertest sequence that the demonstrator has provided. Make sure to select more options and set the Histogram display to YES You could also change the Expectation upper value to 0.001 to help make the results clearer To see the histogram, go to the Tool Output tab in the results Question 15: Describe how the observed vs expected histogram looks? What does this mean? How many results have an alignment with an expect score better than 0.001? Instruction: Repeat the search, but this time use the SEG filter from the more options parameters. Make sure that the Histogram display is still set to YES, and expectation value to 0.001 if you want to clearly compare. Question 16: Now how does the observed vs expected histogram look? How many results have an alignment with an expect score better than 0.001? 39 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Database Composition We’ve seen how the statistics can be skewed by low-complexity in the query sequence, but the database composition can also affect the statistics of the result. In order to detect homologous sequences both FASTA and BLAST make the assumption that your database contains a wide range of unrelated sequences, so the distribution of scores can be used to assess how likely an observed similarity is going to be due homology vs just chance. This is a safe assumption for the main databases at EMBL-EBI, however there are some specialist databases that require different treatment. Using the histogram output from FASTA is a good way of checking for problems with the statistics again you’re hoping to see a good correlation between observed distribution of scores and the expected distribution, over most of the score range. If there is a lack of correlation due to the database then we can correct for this by effectively bulking out the database with randomly shuffled versions of a sample of the database entries. This gives us a fuller range of scores and allows us to statistically score our real hits more accurately. Unfortunately, this is not possible using BLAST, but in FASTA tools (including SSEARCH, GGSEARCH etc.) there is an option under the more options button called ‘Statistical estimates’. Here you can select shuffle modes, for example Regress/shuf., Instruction: Perform a FASTA search against the IMGT/HLA database for the allele sequence that the demonstrator has provided. Make sure to select more options and set the Histogram display to YES To see the histogram, go to the Tool Output tab in the results Question 17: Describe how the observed vs expected histogram looks? What does this mean? How many results have an alignment with an expect score better than 0.001? 40 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Instruction: Repeat the search, but this time change the statistical estimates to use a shuffled method. Make sure that the Histogram display is still set to YES Question 18: Now how does the observed vs expected histogram look? How has this affected the results? Vector Contamination Another reason you might not get the results you are expecting is due to vector contamination – a common problem if your sequence is fresh from the sequencer. One way to check for this problem if you suspect something is to search against a specialist dataset containing vector sequences only – at the EBI the EMVEC database does exactly this, and there is an NCBI mode to perform this role. Question 19: A student has given you two sequences and they have forgotten whether they have already trimmed them for vector contamination. Use the BLAST tools at EBI to determine whether they have vector contaminants or not. Sequence 1: Sequence 2: 41 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Multiple Sequence Alignment We’ve already seen how we can apply rigorous algorithms to align a pair of sequences, but what happens when you want to align more than two sequences? This is where multiple sequence alignment (MSA) comes in. Ideally, a multiple alignment would carry out rigorous alignments between every possible combination of sequences, and then use this Sequences Time information to optimise a final alignment between all of the sequences. Unfortunately this Weighted 2 1 second Sum of Pairs method is incredibly computationally demanding! 3 150 seconds As a result, we have to use heuristics again to bring down alignment times to something that is viable. 4 6.25 hours One method is called progressive alignment. Here subsets of the alignments are carried out and then fixed, to which further alignments take place. This builds up the multiple alignment in a tree fashion. 5 39 days 6 16 years 42 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI ClustalW2 ClustalW2 is an example of a progressive/tree based multiple sequence alignment. It performs a quick pairwise alignment of the sequences before fixing the highest scoring aligned pair and treating them as a single sequence. Other sequences are then aligned to this and fixed in turn, building up a progressive alignment. A guide tree is the term used for the tree created as part of a progressive alignment process, and is used to help order and arrange sequences to be added to a multiple sequence alignment. This is not a phylogenetic tree! A very common mistake is for people to use a guide tree as a phylogenetic tree. This works very quickly, but has some drawbacks, especially if the highest scoring pair are badly aligned, as this alignment error will propagate through the rest of the alignment. ClustalW2 at the EBI can be found in the services section of the website or by searching for ClustalW2 via EBI Search. Instruction: Navigate to the ClustalW2 page You should see something like the following: 43 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI The layout is similar to the pairwise alignment tools we looked at earlier, with an input section and a parameters section. As usual, information about each parameter can be found by clicking the links above each parameter. Key options are described below. Alignment Type ClustalW2 has the option of performing ‘slow’ or ‘fast’ initial alignments – slow is already quite quick so choose this in most cases. Matrix The comparison matrix to be used to score alignments 44 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Gap open Penalty for the first residue in a gap No end gaps Exclude end gaps Gap extension Penalty for each additional residue in a gap Gap distances ClustalW has an additional gap separation penalty No end gaps When set to no this ignores the gap separation penalty at the ends of the alignment Iteration Iteration type Numiter Maximum number of iterations to run Clustering Neighbour Joining is the default clustering option, but UPGMA is available which might help with very large numbers of sequences Output formats What format you want the output file to be in Output order Here you can choose to keep the original input order or to order by alignment 45 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Results When your job has finished, you should see the following: Like other tools in our framework, there are several tabs to switch between different results pages. At the top of this screenshot there are links to the Input form (if you want to submit a new job), Web services (links to the relevant documentation for using ClustalW2 programatically) and Help & Documentation. Alignments This page shows the alignment, along with a button to download/show the alignment text file, and another button to colour the sequences (if they are protein) according to their physico-chemical properties: Colour Property Residues Red Small (small+ hydrophobic (incl.aromatic -Y)) AVFPMILW Blue Acidic DE Magenta Basic - H RK 46 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Green Hydroxyl + sulfhydryl + amine + G STYHCNGQ Grey Unusual amino/imino acids etc Others The key is similar to that which we’ve seen before for other alignment results, however all the sequences are lined up together vertically, and consensus symbols are displayed at the bottom of the columns . Gaps in sequence are displayed with a ‘–‘ character. Where all sequences have the same residue in a column a ‘*’ character is displayed beneath the column. Where similar residues line up there is a ‘:’ character. Where less well conserved substitutions are made they are marked with a ‘.’. Result Summary The result summary page lists the files that the program produces and displays the scores table from ClustalW2, which lists the alignment scores for each pair of sequences used to make up the multiple alignment. Note these are not the most accurate pairwise scores (for this, use a pairwise alignment tool that can produce optimal alignments), nor do they represent the scores of the final (multiple) alignment. There is a button to launch Jalview as well (if your browser has Java installed and up to date). Jalview Jalview is a standalone multiple sequence alignment viewer that allows for more useful viewing than simply looking at the text output of a ClustalW alignment. It can be downloaded and run on its own, however at the EBI we have incorporated an applet version of it into the website, so all you need to 47 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI do is click the Jalview button on the results page and, assuming java is set up correctly on your machine, Jalview should eventually load with your multiple sequence alignment all ready for viewing. Jalview is quite a powerful tool, with more options than we can go into in this document, however full documentation can be found at the Jalview homepage: http://www.jalview.org/ The graphs under the alignment represent various properties: Conservation measures the number of conserved physic-chemical properties for each column Quality measures the likelihood of observing mutations in a particular column – a high score suggests there are no mutations, or that mutations found are favourable as given by the BLOSUM 62 matrix Consensus shows the most common residue per column and the percentage of alignments that contain this residue, by default gaps are included in this calculation. Guide Tree The next tab is the guide tree. 48 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI This tab displays any guide trees produced by the MSA tool. Please not that this is NOT a phylogenetic tree! You can download the data for the tree via the button here, or from the link in the Result Summary tab. Phylogenetic Tree The next tab is the phylogenetic tree. 49 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Unlike the guide tree, this is a real, though basic phylogenetic tree. It is generated by separately running the Clustalw_phylogeny tool - see later in this manual for details. Submission Details 50 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI This tab contains all the information about how your job was submitted to the servers, including the command line run and all the parameters. This is very useful if you are wanting to replicate a job on a local machine, and the information on this page is also useful to us in Support if you need help with a problem. Instruction: Try running your own ClustalW2 alignment using the provided sequences: You can use email, but the job should be quick enough to run interactively Leave the parameters set at their defaults Turning on colours may make it easier to see regions of similar properties, or you can use Jalview to display the alignment Question 20: This example includes the two sequences we tried to align earlier in the roadshow. Does the multiple alignment give any insight into the result we achieved before? So we’ve tried a fairly simple (and small) multiple sequence alignment. The next few alignments with ClustalW2 replicate the errors that people ask us for help with, so you know what to do if you see them! 51 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Question 21: Perform a ClustalW alignment using the file ‘Problem_MSA1.fsa’ What is the error message shown? What is wrong with our input that caused this error? Question 22: Perform a ClustalW alignment using the file ‘Problem_MSA2.fsa’ What is the error message shown? What is wrong with our input that caused this error? Question 23: Perform a ClustalW alignment using the file ‘Problem_MSA3.fsa’ What is the error message shown? What is wrong with our input that caused this error? 52 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Question 24: Perform a ClustalW alignment using the file ‘Problem_MSA4.fsa’ What is the error message shown? What is wrong with our input that caused this error? 53 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Other Multiple Sequence Alignment Tools While ClustalW2 is one of the most popular MSA tools currently, it has a number of drawbacks, notably in the danger of propagating errors from the initial alignments throughout the whole alignment or the way it deals with unusual alignments. Nonetheless its speed and wide usage make it a useful tool. Clustal Omega Clustal Omega is the latest tool from the Clustal authors. It uses a number of new techniques to significantly improve alignments over ClustalW2 including seeded guide tree generation and HMMHMM alignments. Other MSA tools are available at the EBI to enable you to perform more accurate alignments or at least to compare the results between different alignments. Some of these are mentioned below: T-Coffee T-Coffee is a tree-based variant of the COFFEE tool, which aims to keep some of the accuracy while enabling it to be run on a viable timescale. It still has some high demands on computer hardware however, and large jobs can take a very long time to run! MUSCLE MUSCLE is another progressive alignment tool, but goes about things in a much cleverer way than ClustalW with the result that accuracy is claimed to be higher for the same or better speed. MAFFT MAFFT uses Fast Fourier Transforms to perform accurate and fast alignments. Kalign Kalign uses an approximate string matching algorithm to estimate sequence distances very rapidly, concentrating on local regions rather than globally aligning, and is the fastest algorithm we offer for large numbers of sequences. The tools can all be found either from the services pages or from the Multiple Sequence Alignments page at the EBI (www.ebi.ac.uk/Tools/msa): 54 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI The way the tools are used is very similar to the way ClustalW2 is used. You should be able to launch Jalview for most of our MSA tools, however if the option is missing you can launch Jalview from another tool and paste the alignment from your tool into it, to view the alignment. 55 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Instruction: Try running your own alignments using the provided sequences and trying out different MSA tools: Choose from any or all of Clustal Omega, MUSCLE, MAFFT, T-Coffee, Kalign See if you can tell any difference in running speed (you might not be able to – this is a very short alignment) Compare the alignment results with other tools including ClustalW2 You might find it easier to use Jalview to compare several alignments You can cut/paste sequences in Jalview to re-order them Question 25: Note any comments you have about the different alignment tools: 56 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Phylogeny So far we’ve been looking at multiple sequence alignment, which is an attempt to align three or more sequences as accurately as possible. Multiple sequence alignments help identify regions of conservation between sequences, but they do not directly model evolutionary relationships between sequences. Phylogeny is (a hypothesis of) the evolutionary history between sequences, and can be examined via phylogenetic analysis. This is too large a topic to go into on this course, but a crucial part of any phylogenetic analysis is a good multiple sequence alignment. ClustalW2 - Phylogeny At EMBL-EBI we do not yet offer comprehensive phylogenetic analysis programs, but there is a basic phylogenetic analysis (neighbour-joining) and tree drawing package that is included with the ClustalW2 program which we have made available at http://www.ebi.ac.uk/Tools/phylogeny/clustalw2_phylogeny/ 57 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI It is used in a similar manner to our other MSA tools, but instead of starting with unaligned input, the input for phylogenetic analysis must be pre-aligned sequences - ie the output from any of our MSA tools (or your own alignment). The input doesn’t have to be from ClustalW2, any multiple sequence alignment in a known format will work. The default output of the ClustalW2 - Phylogeny package is a tree file and Phylogram representation: Note that the Phylogram requires Java to display - modern browsers might disable Java by default so if you have problems displaying the tree image check these settings. Saving phylogenetic trees Unfortunately, because the tree image is dynamically generated by Java, you cannot download it as a simple image. To create an image either take a screenshot of your browser, or download the tree data which is in a standard tree format, and use third-party tree viewing software to create a tree image. 58 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI WebPRANK So far the tools we’ve looked at for MSA tend towards roughly similar behaviours. It has been speculated that this might be because they are usually benchmarked against only a few specific datasets, and thus they tend to be optimised towards high scoring results for those tests. Also they tend to favour multiple independent deletions over insertions, leading to sequences that shrink in length over evolution, which isn’t a view backed up by evolutionary evidence. PRANK is a tool which tries to address these shortcomings by using phylogenetic information to keep track of deletions as they occur through the sequence evolution. PRANK was developed by the Goldman group (and Ari Löytynoja in particular) at the EBI, so has its own page as part of the Goldman research group http://www.ebi.ac.uk/goldman/. pages, which can be found at Instruction: Navigate to http://www.ebi.ac.uk/goldman-srv/webprank/ 59 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI As you can see, it looks a little different from our general sequence analysis tools. It should open with the Sequence input and submission section open, and it is here that you can paste or upload your sequence. This is also where the Start alignment button is located. To access the options for changing parameters, you have to click on the links below the input section. The previously open section will contract and the new section will open and allow you to view or make changes. You can also retrieve previously submitted jobs, or use the webPRANK tools to view alignments from another source (or that were previously saved). Results Once the job has run you have several options. You can open the results in the browser, open them in the webPRANK viewer, or download the results. The job-ID is also listed should you want to note it down to retrieve at another date. The webPRANK viewer allows you to view the alignment interactively, as well as the phylogentic information that has helped inform the alignment. There is also a reliability score which allows you to remove sites with lower reliability, either based on the currently selected node or on the lowest score. This will mask portions of the alignment, allowing you to export just the higher reliability sections. 60 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Instruction: Try running your own webPRANK alignment using the provided sequences. Input the sequences in the top section Have a look at the different options available in the other sections But keep to defaults for this run When the job is finished you can view the results with several methods, try the webPRANK viewer Question 26: How does the alignment in webPRANK compare with ClustalW2? How long is the alignment? What are the likely reasons for this? 61 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Getting HELP Read the database Documentation Frequently Asked Questions: http://www.ebi.ac.uk/help/faq.html EBI Support: http://www.ebi.ac.uk/support/ Hands-on training programme: http://www.ebi.ac.uk/training/handson/ Related articles from the EBI The IMGT/HLA database [http://dx.doi.org/10.1093/nar/gks949] PSI-Search: iterative HOE-reduced profile SSEARCH searching [http://dx.doi.org/10.1093/bioinformatics/bts240] A new bioinformatics analysis tools framework at EMBL-EBI [http://dx.doi.org/10.1093/nar/gkq313] The European Bioinformatics Institute’s data resources [http://dx.doi.org/10.1093/nar/gkp986] Web services at the European Bioinformatics Institute-2009 [http://dx.doi.org/10.1093/nar/gkp302] The Universal Protein Resource (UniProt) in 2010 [http://dx.doi.org/10.1093/nar/gkp846] The IntAct molecular interaction database in 2010 [http://dx.doi.org/10.1093/nar/gkp878] The Gene Ontology in 2010: extensions and refinements [http://dx.doi.org/10.1093/nar/gkp1018] The Proteomics Identifications database: 2010 update [http://dx.doi.org/10.1093/nar/gkp964] InterPro: the integrative protein signature database [http://dx.doi.org/10.1093/nar/gkn785] Reactome knowledgebase of human biological pathways and processes [http://dx.doi.org/10.1093/nar/gkn863] Further reading Needleman, S. B. and Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 Smith, T. F and Waterman, M. S. (1981) Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 Pearson, W. R. and Lipman, D. J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. U. S. A. 85, 2444–2448 Ning, Z., Cox, A. J. and Mullikin, J. C. (2001) SSAHA: a fast search method for large DNA databases. Genome Res. 11, 1725–1729 Kent, J. (2002) BLAT – the BLAST-like alignment tool. Genome Res. 12, 656–664 Dayhoff, M. O., Schwartz, R. M. and Orcutt, B. C. (1978) A model for evolutionary change in proteins. in Atlas of Protein Sequence and Structure, (Ed. Dayhoff, M. O.) Vol. 5, pp. 345–352 (National Biochemical Research Foundation) Henikoff, S. and Henikoff, J. G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. U. S. A. 89, 10915–10919 Altschul,S.F., Warren,G., Webb,M., Eugene,W.M. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol. 215:403–410 Altschul,S.F., Madden,T.L., Schäffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17):3389-402. Lopez,R., Silventoinen,V., Robinson,S., Kibria,A. and Gish,W. (2003) WU-Blast2 server at the European Bioinformatics Institute. Nucleic Acids Res. 31(13):3795-8 . 62 Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Mackey,A.J., Haystead,T.A. and Pearson,W.R. (2002) Getting more from less: algorithms for rapid protein identification with multiple short peptide sequences. Molecular and Cellular Proteomics 1(2):139-147. Brown,N.P., Leroy,C. and Sander,C. (1998) MView: a web-compatible database search or multiple alignment viewer. Bioinformatics 14(4):380-381. Thompson,J.D., Plewniak,F., Thierry,J.C. and Poch,O. (2000) DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res. 28(15):2919-2926. Mickael Goujon, Hamish McWilliam, Weizhong Li, Franck Valentin, Silvano Squizzato, Juri Paern and Rodrigo Lopez (2010) A new bioinformatics analysis tools framework at EMBL–EBI. Nucleic Acids Res. 31(13):3795-8 . Mileidy W. Gonzalez and William R. Pearson (2010) Homologous over-extension: a challenge for iterative similarity searches. Nucleic Acids Res. 2010 April; 38(7): 2177–2189 63 Sequence Searching and Alignments - Andrew Cowley