EMBL-EBI Introduction to Sequence Searching and Alignments, 2013 EMBL-EBI Contents USING ALIGNMENT TOOLS AT EMBL-EBI ..................................................... ERROR! BOOKMARK NOT DEFINED. PAIRWISE ALIGNMENT TOOLS..................................................................................................................................... 4 Step 1 – input ................................................................................................................................................. 5 Step 2 –parameters ........................................................................................................................................ 5 Step 3 – submit .............................................................................................................................................. 6 RESULTS ................................................................................................................................................................ 7 Results page ................................................................................................................................................... 7 Alignment ....................................................................................................................................................... 8 Submission details .......................................................................................................................................... 8 Implementing the Methods for Sequence Searching Tools: BLAST .............................................................. 12 Implementing the Methods for Sequence Searching Tools: FASTA .............................................................. 14 BLAST & FASTA Sensitivity ............................................................................................................................ 14 Sequence Searching Similarity tools at the EBI ............................................................................................ 16 USING FASTA ................................................................................................................................................. 18 Step 1 – database selection ......................................................................................................................... 19 Step 2 – input ............................................................................................................................................... 20 Step 3 –parameters ...................................................................................................................................... 20 Step 4 – submit ............................................................................................................................................ 21 RESULTS .............................................................................................................................................................. 21 Summary Table ............................................................................................................................................ 22 Tool Output .................................................................................................................................................. 22 Visual Output ............................................................................................................................................... 23 Functional Predictions .................................................................................................................................. 24 Submission Details ....................................................................................................................................... 24 INTERPRETING AN ALIGNMENT ................................................................................................................................. 25 USING BLAST.................................................................................................................................................. 28 NCBI BLAST parameters ............................................................................................................................... 28 WU-BLAST parameters................................................................................................................................. 29 BLAST RESULTS .................................................................................................................................................... 29 DIFFERENCES BETWEEN BLAST AND FASTA .............................................................................................................. 32 When to use what? ...................................................................................................................................... 32 PSI-BLAST....................................................................................................................................................... 33 PSI-BLAST Threshold ..................................................................................................................................... 34 PROBLEMS WITH ITERATIVE SEARCHES ......................................................................................................... 37 HOMOLOGOUS OVER-EXTENSION (HOE) ....................................................................................................... 37 FILTERS .......................................................................................................................................................... 38 VECTOR CONTAMINATION ............................................................................................................................ 40 GETTING HELP ................................................................................................................................................ 41 RELATED ARTICLES FROM THE EBI ................................................................................................................. 41 FURTHER READING ........................................................................................................................................ 41 2 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Information: Course materials (including a copy of this manual) can be downloaded from: http://www.ebi.ac.uk/~apc/Courses/Faroe/ 3 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Using alignment tools at the EBI There are a variety of sequence alignment tools available at EMBL-EBI. The simplest tools are pairwise alignment tools which are designed to align two sequences. To align three or more sequences together you must use a multiple sequence alignment tool. Pairwise alignment tools Instruction: Navigate to the EBI pairwise sequence alignment page You can either type in an address directly (www.ebi.ac.uk/Tools/psa/), search with the term PSA, or go through the services list as demonstrated. You will be faced with a choice of programs, split into global, local or genomic alignments. Once you chose a program, the input page looks like this: 4 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Input for pairwise alignments is very simple. Step 1 – input Here you can enter your two sequences. Most common formats are recognised, but please don’t try to invent your own using Word! You can either paste/type the sequence directly, or upload a file containing it from your computer using the browse button. Step 2 –parameters The program will be set up with some default parameters, however you can change them if you wish. You can click on any parameter title to get help on it. The main options are: 5 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Matrix The comparison matrix to be used to score alignments when searching the database Gap open Penalty to start a gap in the alignment Gap extend Penalty for each base or residue in the gap Step 3 – submit When you’re happy with everything else, select submit to run the job. 6 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Results Results page The results page is fairly simple. At the top of the page are some tabs that switch between the alignment, submission details, and the form to submit a new job. The parameters used for the job 7 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI are shown, and there is a link to the output file. The output file text is also displayed further down the page. Alignment This page shows the results of the alignment. The table gives a summary of the results: Length reports the length of alignment. Identity reports the number of identical residues that are found aligned between the two sequences. Similarity reports the number of aligned residues that score positively in the substitution matrix (ie similar types of residues). Gaps reports the number of gaps inserted into the alignment. Score gives the literal score of the alignment as worked out by the algorithm. Then the alignment itself is displayed: The top line is the first sequence and the bottom is the second sequence. Gaps in sequence are displayed with a ‘–‘ character. Where two identical residues line up they are connected with a ‘|’ where two very similar residues line up they are connected with a ‘:’. Where less well conserved substitutions are made they are connected with a ‘.’. Submission details This page gives details of how the program was run. It tells you what version of the tool was run, when it was launched, details of your input and the tool output, as well as the original command line used to launch the job and details of the selected parameters. These can all be useful if you need to recreate a job on your local machine, or to repeat an alignment in the future. This page is also very useful to us if you have a problem and need to contact us. 8 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Instruction: Try running your own global alignment using the provided sequences: Use the EMBOSS needle program Leave the parameters set at their defaults Question 1 (global): What is the Length, Score, %identity, %similarity and %gaps of the alignment? Length Score Identity% Similarity% Gaps% Now let’s try a local alignment. Instruction: Try running your own local alignment using the provided sequences: This time use the EMBOSS water program Leave the parameters set at their defaults 9 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Question 2 (local): What is the Length, Score, %identity, %similarity and %gaps of the alignment? Length Score Identity% Similarity% Gaps% Question 3: In words, how would you describe the key differences between the global and local alignment results? Can you think of ways to improve the alignment? Now let’s try changing the matrix parameter. Instruction: Try running your own local alignment using the provided sequences: Use the EMBOSS water program Change the matrix to BLOSUM 40, and leave the other parameters at their defaults 10 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Question 4: What is the Length, Score, %identity, %similarity and %gaps of our new alignment? Can you describe what has happened? Length Score Identity% Similarity% Gaps% 11 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Alignment against databases While it’s possible (and very accurate) to run optimal alignments against databases (using SSearch or GGsearch at the EBI for example), the computational requirements are such that it takes a very long time, and uses a large amount of memory. It is more practical to use a heuristics based method such as BLAST or FASTA. Implementing the Methods for Sequence Searching Tools: BLAST BLAST, which stands for Basic local alignment tool and was developed by Altschul and colleagues in 1990. BLAST uses an approximation of the Smith–Waterman algorithm which makes is quite fast, however this gain in speed is offset by a decrease in accuracy. Unlike the true Smith–Waterman algorithm, BLAST is not guaranteed to find the optimal alignment between your query sequence and the test sequences. However, it will find good alignments and provides a statistical means of gauging your confidence in each alignment: (1) It searches for ‘words’ of a user-defined length (the shorter the word, the more sensitive the search). (2) It then extends these words in both directions until it finds a mismatch. (3) It then performs an approximation of the Smith–Waterman algorithm to create a gapped alignment between the query sequence and the test sequence. (4) Finally it calculates and reports the probability of the alignment occurring by chance. 12 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI When doing a Blast search one can set the “expectation threshold” (EXP THR) which establishes a statistical significance threshold for reporting database sequence matches. EXP THR is interpreted as the upper limit on the expected frequency of a chance occurrence of a match within the context of the entire database search; in other words, it sets an upper limit on the E-value. Any database sequence whose BLAST alignment to the query sequence satisfies EXP THR is reported in the output file. An alignment with an E value of ≥1.0 is expected to be found at least once by chance in the searched database and an E value of ≥5.0 is expected to be found at least five times (see figure below). Raising this threshold increases the likelihood of reporting distantly related matches, but the frequency of chance matches reported will tend to grow at a much faster rate than real matches with EXP THR set above 1.0. 13 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Implementing the Methods for Sequence Searching Tools: FASTA David Lipman and William Pearson (1988) developed FASTA which gets around the speed problem by: (1) FASTA breaks the query and test sequences into overlapping words and looks for exact matches. (2) Then it re-scores these matches using a substitution matrix. (3) Next it tries to join the highest-scoring segments. A ‘joining threshold’ set by the user eliminates segments that are unlikely to be part of the alignment. (4) Finally, FASTA uses the Smith–Waterman method to optimise the alignment, using only the part of the matrix that contains the top-scoring segments. BLAST & FASTA Sensitivity 14 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI FASTA therefore provides a means of performing a sensitive search against a large database in a reasonable time. Nowadays, it is possible to approach the sensitivity of a FASTA search by using BLAST and setting high sensitivity values. However, the Sensitivity: There is a trade-off between alignment statistics for FASTA can be considered more sensitivity and search speed. Increasing the robust than those for gapped-BLAST. This is because sensitivity makes the search more FASTA produces and scores an alignment of the query computationally intensive and therefore slower. sequence with a large sampling of the database, giving it Decreasing the sensitivity, for example when a distribution of scores that represents the entire you are looking for almost exact matches, can database:sequence range of alignments. BLAST is fast dramatically increase the search speed. There is because it does not bother producing an alignment for also a trade-off between sensitivity and most database sequences; alignments are only specificity: increasing sensitivity tends to triggered if the initial word-match criteria are met. decrease specificity (greater propensity for Consequently, BLAST does not have a complete chance matches). So for example: if you are distribution of alignment scores over the database from looking for vector contamination you will which to calculate the significance of the reported choose low sensitivity, whereas if you are matches. Instead, BLAST uses pre-computed values for looking for long distance related sequence you will opt for high sensitivity. the score distribution rather than calculating values that are specific to the search carried out. 15 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Sequence Searching Similarity tools at the EBI The EBI provides you the option to use several sequence searching similarity tools at www.ebi.ac.uk/Tools/sss. These are maintained by the External Service (ES) team. The ES team puts considerable effort in making sure to provide you with the state of the art tools in sequence searching and also with the flexibility to tailor your queries to the most appropriate search parameters. Therefore you will find several options not only for the types of tools but also within the tools, the amount of variable and parameters you can change. 16 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI We have talked about BLAST and FASTA but what are all the variations above? A quick way to make a distinction among these tools is to label them by being either heuristic or rigorous. A fundamental challenge in computer science is to make algorithms that find verifiable good solutions using a proved bounded amount of computation time. A heuristic algorithm gives up one or both of these goals. In other words heuristic is an algorithm that is able to produce an acceptable solution to a problem in many practical scenarios, but for which there is no formal proof. Heuristics are typically used when there is no known method to find an optimal solution, under the given constraints (of time, space etc.) or at all. Rigorous on the other hands means to applying an algorithm that can produce proof, that gives you the most optimal solution, therefore it is exhaustive and should provide you with the best answer, however it is slow. Following these definitions, the Sequence Searching Tools mentioned above are label as shown on this panel: 17 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Using FASTA Let’s have a go using FASTA for ourselves. Instruction: Navigate to the EBI sequence search tools page You can either type in an address directly (www.ebi.ac.uk/Tools/sss/) search with the term FASTA, or go through the services list as demonstrated. The framework from FASTA is launched by picking a type of database listed next to the FASTA tool – this sets up the defaults appropriately for your choice, but you can always change the database once in the tool if you wish. Instruction: Select the Protein database for FASTA which we launch our tools was revamped recently, and now forms a common basis for many programs. For more information about this framework see: A new bioinformatics analysis tools framework at EMBL–EBI (2010) Goujon et al. doi:10.1093/nar/gkq313 You should end up at a screen that looks like the following: 18 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI This screen will hopefully look quite familiar to you when conducting other searches as well. There are four steps to submitting a job. Step 1 – database selection Here is where you select which databases to search against. You can select more than one, and you can also expand subsections to narrow your search by taxonomic division for example. If you changed your mind and want to do a different type of search (for example, against nucleotide sequences) then you can select that here as well and it will reset the form. 19 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Step 2 – input Here you can enter your query sequence. Most common formats are recognised, but please don’t try to invent your own using Word! You can either paste/type the sequence directly, or upload a file containing it from your computer using the browse button. Step 3 –parameters Most important here is choice of program – FASTA is grouped together with others like SSEARCH (because they were authored by the same person and come in the same package). It should be set up correctly according to the button you pressed to get to this page. The program will be set up with some default parameters, however to look at these or change them you need to click the ‘More options’ button. You can click on any parameter title to get help on it. Here is a summary: Matrix The comparison matrix to be used to score alignments when searching the database Gap open Penalty to start a gap in the alignment Gap extend Penalty for each base or residue in the gap Ktup ‘Word size’ used to identify runs in the first stage of alignment Expectation upper value This allows you to ignore results that have above a certain expectation score (ie become more distant) Expectation lower value This allows you to ignore results below a certain expectation (ie ignore close relatives) DNA strand When searching DNA you can specify which strand is used. By default both are searched Histogram Turn on/off the display of statistical histogram in FASTA results Filter Which low complexity filter to use Statistical estimates Which statistical method to use to evaluate values used in the Expect score calculation Scores Maximum number of scores reported in the summary Alignments Maximum numbers of alignments reported in the summary Sequence range Allows you to specify which portion of the query sequence to use in the search Database range Allows you to cut back on database sequences searched against by specifying a number of residues range 20 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Step 4 – submit When you’re happy with everything else, here is where you click to submit the job. For longer runs it is recommended that you tick the box to send you an email when the job is complete. This will contain a link back to the results so there is no need to keep your browser open. Email jobs are usually stored on the servers for longer as well, while interactive job results are deleted more quickly. Once you’ve submitted your job the first thing that happens is that the input is validated – few things are more frustrating than preparing a job and firing it off to then check your email later in the day and find that you made a minor mistake somewhere and the job failed. If everything is okay then the job will run and you will see a job running screen if you ran it interactively. Eventually the job will finish and you will either be taken to the results page automatically or emailed a link to it. Results 21 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Summary Table There is a lot of information to take in for the results page! You are first presented with the summary table which quickly lists the top results in a table format. You can change the ordering of the table by clicking the arrows next to each column header – by default they are ranked by Expect value or E(). The first column contains a tick box which allows you to select database results for further actions, for example to view the sequence annotation or detailed alignment with the query sequence. You can also use the buttons to clear selections, select all, or invert selection. To download the selected sequences click the download button. The second column (DB:ID) gives the database ID of the sequence. The third column (Source) gives some quick information about the sequence, as well as cross references that have been found referring to it in other resources across the EBI – so you can quickly look up more information about the sequence in these resources. Length reports the length of the database sequence. Score gives the literal score of the alignment. Identities reports the number of identical residues that are found aligned between the query and database sequence. Positives reports the number of aligned residues that score positively in the substitution matrix (ie similar types of residues). E() gives the Expect score for the alignment – this is a measure of how likely you are to find that alignment by chance. When the numbers are very small it reports them as 1.0E-10 for example. This is the same as 1e-10, or 0.0000000001. Tool Output This tab switches the view to the raw, original output from FASTA – this can be useful when you want to view the full text output from the program in case it contains something the summary or other pages don’t cover. You can download it as a text or XML file. Clicking on the icon will jump you straight to the alignment. Clicking on the sequence ID will take you to the original database entry for that sequence. 22 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Visual Output This output gives you a nice way of visualising which portions of the sequence are aligning, as well as colour-coding the alignment by Expect score. Hovering your mouse over the sequence ID on the left-hand side will show a guide box around the alignment. Clicking the mouse will take you to the alignment (from the raw output of the program). 23 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Functional Predictions This tab is another example of how we are bringing together different resources at the EBI to give you extra information. It searches a variety of resources for family and domain predictions and shows you the results graphically, so you can easily see which portions of your sequence and alignments correspond to these features. Again, you can click on the links on the left-hand side to jump to the entry information at the resource. Submission Details This page contains all the original parameters used to launch your job, together with easy links to the exact input used and the output results. This information is really useful for several reasons: If you want to repeat a job then you might want to use exactly the same parameters; if you’re interested in running the command line version of the tool then this will give you the exact command line used; and finally if you need help or support then this page contains all of the information you need to give us to be able to help you quickly. 24 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Interpreting an alignment The figure below shows a typical alignment Key - Gap : Identity . Similarity X Filtered The header shows some information about the database sequence, followed by some of the raw scores from the program itself. The key bits of information are the E() score, the % identity and % similarity numbers. Below that is the actual alignment itself – the top line is the query sequence and the bottom is the database sequence. Gaps in sequence are displayed with a ‘–‘ character (none in the above alignment). Where two identical residues line up they are connected with a ‘:’ where two similar residues line up they are connected with a ‘.’ Instruction: Try running your own FASTA search using the provided sequence: Search against the UniProtKB/Swiss-Prot database only – if you search against the full UniProt Knowledgebase it will take a very long time! Make sure the FASTA program is selected Leave the parameters set at their defaults 25 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Question 5: What are the default gap open, gap extend, ktup and matrix parameter settings for this search? Question 6: What can you say about the likely function of this protein? Question 7: What are the DB:IDs, Scores, %identities, %positives and E() of the top two results? ID Score Identity% Positives% E() 26 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Question 8: Have a look at the actual alignments with the top two results – what can you say about them? 27 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Using BLAST Now that we are familiar with running FASTA searching, using BLAST should be very easy – the interface is effectively the same. Simply select a database next to the BLAST tool of interest to enter the tool. NCBI BLAST This version of BLAST is the version maintained at the NCBI WU-BLAST This version of BLAST was created by Dr Warren Gish, formerly of Washington University Both versions can trace their history back to the same algorithm, but were developed separately, often implementing ideas that first appeared in the other version. The parameters are handled slightly differently in each case: NCBI BLAST provides access to parameters like those used by FASTA or SSEARCH (eg gap open, gap extend). WU-BLAST hides direct access to those, but instead provides a sensitivity parameter which combines several adjustments at different stages of the algorithm. The raw results are slightly different between the two, but as we parse the results they will appear the same to you (unless you look at the Tool Output page). NCBI BLAST parameters Matrix The comparison matrix to be used to score alignments when searching the database Gap open Penalty to start a gap in the alignment Gap extend Penalty for each base or residue in the gap Exp. Thr (expectation threshold) This allows you to ignore results that have above a certain expectation score (ie become more distant) Filter Which low complexity filter to use Drop off Controls how far a potential HSP is allowed to extend Scores Maximum number of scores reported in the summary Alignments Maximum numbers of alignments reported in the summary Sequence range Allows you to specify which portion of the query sequence to use in the search Gap align Allows gapped extensions of alignments Alignment views Options for formatting the alignment output (Tool Output) 28 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI WU-BLAST parameters Matrix The comparison matrix to be used to score alignments when searching the database Exp. Thr (expectation threshold) This allows you to ignore results that have above a certain expectation score (ie become more distant) Filter Which low complexity filter to use View filter Display any sequence filtered out (in Tool Output) Sensitivity General parameter affecting search sensitivity – this makes adjustments to several internal parameters Scores Maximum number of scores reported in the summary Alignments Maximum numbers of alignments reported in the summary Sort Choose which value to sort the Tool Output results by. Stats Choice of statistic methods used in generation of Expect statistics topcomboN In WU-BLAST HSPs are classified into a number of sets, you can use this parameter to restrict the display to only the N highest scoring sets. Alignment views Options for formatting the alignment output (Tool Output) BLAST results Unsurprisingly, the results from BLAST runs on our servers are displayed in the same way as FASTA results, and all the tabs are equivalent. The main differences in format will only appear if you look at the raw output via the Tool Output page. 29 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Key - Gap [residue] Identity + Similarity X Filtered The above picture shows the alignment section of an NCBI BLAST run. This time identical residues are actually listed between the two sequences, which are labelled Query for query sequence and Sbjct (subject) for database sequence. Similar residues (Positives) are indicated with a +. The number of gaps inserted into the overlapping sequence regions is also reported. Instruction: Try running your own NCBI BLAST search using the provided sequence: Search against the UniProtKB/Swiss-Prot database only – if you search against the full UniProt Knowledgebase it will take a very long time! Make sure the BLASTP program is selected Leave the parameters set at their defaults 30 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Question 9: What are the default gap open, gap extend, drop off and matrix parameter settings for this search? Question 10: What are the DB:IDs, Scores, %identities, %positives and E() of the top two results? ID Score Identity% Positives% E() Question 11: Are these different from our FASTA search earlier? 31 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Differences between BLAST and FASTA The following table summarises the key differences between BLAST and FASTA BLAST FASTA Fast Good with proteins Might miss potential alignment Not as fast as BLAST Good with proteins and DNA Aligns against all database sequences Produces S&W alignments Good for cousins Produces HSPs Good for siblings When to use what? In general the larger the database the faster the algorithm you should use, and likewise the larger the query sequence the faster the algorithm you should use. For very small queries or databases then dynamic programming methods like SSEARCH can be great. 32 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI PSI-BLAST Position-Specific Iterative BLAST, or PSI-BLAST, is a clever tool which allows you to create your own custom scoring matrix based on the conservation of residues you find in your own searches, rather than some model made with different sequences. PSI-BLAST workflow It starts with a normal BLAST, however you can then 1. Normal BLAST search select which sequences in the results will be used to build a profile. These sequences are then aligned and conserved residues at each position are scored more highly in a new type of matrix which allows for different scores at different positions in the sequence. A new BLAST search is run with this matrix, called a Position Specific Scoring Matrix or PSSM. The results can themselves be used to create another PSSM for another run, and so the process is iterative. The aim of PSI-BLAST is concentrate the alignment on positions that are important, while allowing for more variability in areas that aren’t so important. So a functional area or binding motive might be more important than sequence that forms part of a loop for example. 2. Align selected results Searches made with a PSSM can find matches with sequences that were scored too low to be considered in a normal BLAST search, but have scored more highly with the new matrix – these are marked as ‘new’ by the PSI-BLAST tool. 3. Create PSSM You can also save your search and continue it at a later date, or save the PSSM itself. (Continued on next page) 33 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI PSI-BLAST workflow (cont.) The parameters for PSI-BLAST are the same as NCBI BLAST, with the addition of a new threshold: 4. Use PSSM to score new BLAST alignment PSI-BLAST Threshold This expectation value controls the default selection of sequences to be used for generation of the PSSM – sequences scoring higher than this (ie don’t align as well) won’t be included. Once the first iteration is run, additional controls over a normal BLAST result appear – the PSIBLAST Threshold can be changed again, or individual sequences can be added or removed from the selection by ticking the box in the first Summary Table column. Go to Step 2 if required The View Threshold limit button jumps the view down the table so you can see the cut off. 34 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Controls to download a checkpoint file or PSSM only appear after the second iteration. Instruction: Try running your own PSI-BLAST search using the provided sequence: Search against the UniProtKB/Swiss-Prot database only – if you search against the full UniProt Knowledgebase it will take a very long time! Make sure the PSI-BLAST program is selected Leave the parameters set at their defaults For the moment, stop after the first run 35 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Question 12: Looking at the first run (normal BLAST results), how many sequences score above our default threshold of 1.0e-3? Instruction: For the second iteration you can choose which sequences to include in the PSSM generation. Untick the top scoring sequence (simply because it scores so much better than the other results – you wouldn’t necessarily normally do this) Leave everything else set to defaults Click the ‘run next iteration’ button Question 13: Looking at the second iteration, how many sequences now score above our default threshold of 1.0e-3? (Hint: use the View Threshold Limit button). What is a likely explanation? Question 14: Have any new sequences been scored well enough to appear in our results? 36 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Problems with iterative searches Iterative searches like PSI-BLAST aim to use profile construction methods to increase search sensitivity, however they can inadvertently decrease selectivity, particularly if the profile becomes contaminated with information that is not relevant to homology with our query sequence. One situation where this occurs is described as Homologous Over-Extension(HOE). Homologous Over-Extension (HOE) Iterative search strategies using profiles (ie the PSSM in PSI-BLAST) might help increase the sensitivity of a search, however while the aim is to have the profile reflect areas of interest (a domain for example) there is a danger that it will be contaminated with information that is not relevant to your query. Low complexity regions are one example of this, but these can be fixed with the use of filters. Another cause of contamination that was recently described is Homologous OverExtension (HOE). HOE can occur in a profile based alignment when the alignment region picks up a portion of sequence that is not biologically relevant to our query but that is conserved in other sequences brought back by the search. The influence of this region on the scoring matrix can be such that the alignment region extends even further beyond the domain of interest. This can even begin to cover a domain that is not present in the query sequence, once this happens the weighting of the scoring matrix can influence the alignment so much that sequences not at all biologically related to the query start to be found as significant, resulting in an increase in false-positives. Ideally this is prevented by careful selection of which sequences to include in the generation of the PSSM and making sure that they do not have other domains near the boundaries of the alignment that might cause alignment extension – our functional prediction page might help with this. But as this is a manual method, and domain information might not be present in the functional predictions, we have created a method to automatically reduce the likelihood of HOE occurring by masking boundaries at the edge of the original alignment. At the moment this method only applies to PSISearch – a tool that combines sensitive Smith-Waterman based local alignment with the PSI-BLAST profile construction strategy, but it can be enabled by toggling the option ‘HOE region masking’ to yes (which is the default setting). 37 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Filters Filtering can eliminate statistically significant but biologically uninteresting reports from the BLAST output by masking out various segments of the query sequence for regions which are non-specific for sequence similarity searches. This leaves the more biologically interesting regions of the query sequence available for specific matching against database sequences. For example, it may be desired to mask acidic, basic or proline-rich segments of a protein that would otherwise yield overwhelming amounts of uninteresting, non-specific matches against a wide array of protein families. The SEG program (Wootton and Federhen, 1993) masks low compositional complexity regions, while XNU (Claverie and States, 1993) masks regions containing short-periodicity internal repeats. SEG+XNU will combine the above two. The DUST program by Tatusov and Lipman can only be used with DNA searches and will mask simple repeats in DNA/RNA sequences. 38 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Instruction: Perform a FASTA search against the UniProtKB/Swiss-Prot database for the filtertest sequence that the demonstrator has provided. Make sure to select more options and set the Histogram display to YES You could also change the Expectation upper value to 0.001 to help make the results clearer To see the histogram, go to the Tool Output tab in the results Question 15: Describe how the observed vs expected histogram looks? What does this mean? How many results have an alignment with an expect score better than 0.001? Instruction: Repeat the search, but this time use the SEG filter from the more options parameters. Make sure that the Histogram display is still set to YES, and expectation value to 0.001 if you want to clearly compare. Question 16: Now how does the observed vs expected histogram look? How many results have an alignment with an expect score better than 0.001? 39 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Vector Contamination Another reason you might not get the results you are expecting is due to vector contamination – a common problem if your sequence is fresh from the sequencer. One way to check for this problem if you suspect something is to search against a specialist dataset containing vector sequences only – at the EBI the EMVEC database does exactly this, and there is an NCBI mode to perform this role. Question 17: A student has given you two sequences and they have forgotten whether they have already trimmed them for vector contamination. Use the BLAST tools at EBI to determine whether they have vector contaminants or not. Sequence 1: Sequence 2: 40 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Getting HELP Read the database Documentation Frequently Asked Questions: http://www.ebi.ac.uk/help/faq.html EBI Support: http://www.ebi.ac.uk/support/ Hands-on training programme: http://www.ebi.ac.uk/training/handson/ Related articles from the EBI PSI-Search: iterative HOE-reduced profile SSEARCH searching [http://dx.doi.org/10.1093/bioinformatics/bts240] A new bioinformatics analysis tools framework at EMBL-EBI [http://dx.doi.org/10.1093/nar/gkq313] The European Bioinformatics Institute’s data resources [http://dx.doi.org/10.1093/nar/gkp986] Web services at the European Bioinformatics Institute-2009 [http://dx.doi.org/10.1093/nar/gkp302] The Universal Protein Resource (UniProt) in 2010 [http://dx.doi.org/10.1093/nar/gkp846] The IntAct molecular interaction database in 2010 [http://dx.doi.org/10.1093/nar/gkp878] The Gene Ontology in 2010: extensions and refinements [http://dx.doi.org/10.1093/nar/gkp1018] The Proteomics Identifications database: 2010 update [http://dx.doi.org/10.1093/nar/gkp964] InterPro: the integrative protein signature database [http://dx.doi.org/10.1093/nar/gkn785] Reactome knowledgebase of human biological pathways and processes [http://dx.doi.org/10.1093/nar/gkn863] Further reading Needleman, S. B. and Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 Smith, T. F and Waterman, M. S. (1981) Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 Pearson, W. R. and Lipman, D. J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. U. S. A. 85, 2444–2448 Ning, Z., Cox, A. J. and Mullikin, J. C. (2001) SSAHA: a fast search method for large DNA databases. Genome Res. 11, 1725–1729 Kent, J. (2002) BLAT – the BLAST-like alignment tool. Genome Res. 12, 656–664 Dayhoff, M. O., Schwartz, R. M. and Orcutt, B. C. (1978) A model for evolutionary change in proteins. in Atlas of Protein Sequence and Structure, (Ed. Dayhoff, M. O.) Vol. 5, pp. 345–352 (National Biochemical Research Foundation) Henikoff, S. and Henikoff, J. G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. U. S. A. 89, 10915–10919 Altschul,S.F., Warren,G., Webb,M., Eugene,W.M. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol. 215:403–410 Altschul,S.F., Madden,T.L., Schäffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17):3389-402. Lopez,R., Silventoinen,V., Robinson,S., Kibria,A. and Gish,W. (2003) WU-Blast2 server at the European Bioinformatics Institute. Nucleic Acids Res. 31(13):3795-8 . 41 Introduction to Sequence Searching and Alignments - Andrew Cowley EMBL-EBI Mackey,A.J., Haystead,T.A. and Pearson,W.R. (2002) Getting more from less: algorithms for rapid protein identification with multiple short peptide sequences. Molecular and Cellular Proteomics 1(2):139-147. Brown,N.P., Leroy,C. and Sander,C. (1998) MView: a web-compatible database search or multiple alignment viewer. Bioinformatics 14(4):380-381. Thompson,J.D., Plewniak,F., Thierry,J.C. and Poch,O. (2000) DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res. 28(15):2919-2926. Mickael Goujon, Hamish McWilliam, Weizhong Li, Franck Valentin, Silvano Squizzato, Juri Paern and Rodrigo Lopez (2010) A new bioinformatics analysis tools framework at EMBL–EBI. Nucleic Acids Res. 31(13):3795-8 . Mileidy W. Gonzalez and William R. Pearson (2010) Homologous over-extension: a challenge for iterative similarity searches. Nucleic Acids Res. 2010 April; 38(7): 2177–2189 42 Introduction to Sequence Searching and Alignments - Andrew Cowley