materials - European Bioinformatics Institute

advertisement
EMBL-EBI
Introduction to Sequence Searching and Alignments, 2013
EMBL-EBI
Contents
USING ALIGNMENT TOOLS AT EMBL-EBI ..................................................... ERROR! BOOKMARK NOT DEFINED.
PAIRWISE ALIGNMENT TOOLS..................................................................................................................................... 4
Step 1 – input ................................................................................................................................................. 5
Step 2 –parameters ........................................................................................................................................ 5
Step 3 – submit .............................................................................................................................................. 6
RESULTS ................................................................................................................................................................ 7
Results page ................................................................................................................................................... 7
Alignment ....................................................................................................................................................... 8
Submission details .......................................................................................................................................... 8
Implementing the Methods for Sequence Searching Tools: BLAST .............................................................. 12
Implementing the Methods for Sequence Searching Tools: FASTA .............................................................. 14
BLAST & FASTA Sensitivity ............................................................................................................................ 14
Sequence Searching Similarity tools at the EBI ............................................................................................ 16
USING FASTA ................................................................................................................................................. 18
Step 1 – database selection ......................................................................................................................... 19
Step 2 – input ............................................................................................................................................... 20
Step 3 –parameters ...................................................................................................................................... 20
Step 4 – submit ............................................................................................................................................ 21
RESULTS .............................................................................................................................................................. 21
Summary Table ............................................................................................................................................ 22
Tool Output .................................................................................................................................................. 22
Visual Output ............................................................................................................................................... 23
Functional Predictions .................................................................................................................................. 24
Submission Details ....................................................................................................................................... 24
INTERPRETING AN ALIGNMENT ................................................................................................................................. 25
USING BLAST.................................................................................................................................................. 28
NCBI BLAST parameters ............................................................................................................................... 28
WU-BLAST parameters................................................................................................................................. 29
BLAST RESULTS .................................................................................................................................................... 29
DIFFERENCES BETWEEN BLAST AND FASTA .............................................................................................................. 32
When to use what? ...................................................................................................................................... 32
PSI-BLAST....................................................................................................................................................... 33
PSI-BLAST Threshold ..................................................................................................................................... 34
PROBLEMS WITH ITERATIVE SEARCHES ......................................................................................................... 37
HOMOLOGOUS OVER-EXTENSION (HOE) ....................................................................................................... 37
FILTERS .......................................................................................................................................................... 38
VECTOR CONTAMINATION ............................................................................................................................ 40
GETTING HELP ................................................................................................................................................ 41
RELATED ARTICLES FROM THE EBI ................................................................................................................. 41
FURTHER READING ........................................................................................................................................ 41
2
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Information:
Course materials (including a copy of this manual) can be downloaded from:
http://www.ebi.ac.uk/~apc/Courses/Faroe/
3
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Using alignment tools at the EBI
There are a variety of sequence alignment tools available at EMBL-EBI. The simplest tools are
pairwise alignment tools which are designed to align two sequences. To align three or more
sequences together you must use a multiple sequence alignment tool.
Pairwise alignment tools
Instruction:
Navigate to the EBI pairwise sequence alignment page
You can either type in an address directly (www.ebi.ac.uk/Tools/psa/), search with the term PSA, or
go through the services list as demonstrated.
You will be faced with a choice of programs, split into global, local or genomic alignments.
Once you chose a program, the input page looks like this:
4
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Input for pairwise alignments is very simple.
Step 1 – input
Here you can enter your two sequences. Most common formats are recognised, but please don’t try
to invent your own using Word! You can either paste/type the sequence directly, or upload a file
containing it from your computer using the browse button.
Step 2 –parameters
The program will be set up with some default parameters, however you can change them if you
wish.
You can click on any parameter title to get help on it. The main options are:
5
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Matrix
The comparison matrix to be used to score alignments when searching
the database
Gap open
Penalty to start a gap in the alignment
Gap extend
Penalty for each base or residue in the gap
Step 3 – submit
When you’re happy with everything else, select submit to run the job.
6
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Results
Results page
The results page is fairly simple. At the top of the page are some tabs that switch between the
alignment, submission details, and the form to submit a new job. The parameters used for the job
7
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
are shown, and there is a link to the output file. The output file text is also displayed further down
the page.
Alignment
This page shows the results of the alignment. The table gives a summary of the results:
Length reports the length of alignment.
Identity reports the number of identical residues that are found aligned between the two
sequences.
Similarity reports the number of aligned residues that score positively in the substitution matrix (ie
similar types of residues).
Gaps reports the number of gaps inserted into the alignment.
Score gives the literal score of the alignment as worked out by the algorithm.
Then the alignment itself is displayed:
The top line is the first sequence and the bottom is the second sequence. Gaps in sequence are
displayed with a ‘–‘ character. Where two identical residues line up they are connected with a ‘|’
where two very similar residues line up they are connected with a ‘:’. Where less well conserved
substitutions are made they are connected with a ‘.’.
Submission details
This page gives details of how the program was run. It tells you what version of the tool was run,
when it was launched, details of your input and the tool output, as well as the original command line
used to launch the job and details of the selected parameters. These can all be useful if you need to
recreate a job on your local machine, or to repeat an alignment in the future. This page is also very
useful to us if you have a problem and need to contact us.
8
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Instruction:
Try running your own global alignment using the provided sequences:


Use the EMBOSS needle program
Leave the parameters set at their defaults
Question 1 (global):
What is the Length, Score, %identity, %similarity and %gaps of the alignment?
Length
Score
Identity%
Similarity%
Gaps%
Now let’s try a local alignment.
Instruction:
Try running your own local alignment using the provided sequences:


This time use the EMBOSS water program
Leave the parameters set at their defaults
9
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Question 2 (local):
What is the Length, Score, %identity, %similarity and %gaps of the alignment?
Length
Score
Identity%
Similarity%
Gaps%
Question 3:
In words, how would you describe the key differences between the global and local
alignment results? Can you think of ways to improve the alignment?
Now let’s try changing the matrix parameter.
Instruction:
Try running your own local alignment using the provided sequences:


Use the EMBOSS water program
Change the matrix to BLOSUM 40, and leave the other parameters at their defaults
10
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Question 4:
What is the Length, Score, %identity, %similarity and %gaps of our new alignment?
Can you describe what has happened?
Length
Score
Identity%
Similarity%
Gaps%
11
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Alignment against databases
While it’s possible (and very accurate) to run optimal alignments against databases (using SSearch or
GGsearch at the EBI for example), the computational requirements are such that it takes a very long
time, and uses a large amount of memory. It is more practical to use a heuristics based method such
as BLAST or FASTA.
Implementing the Methods for Sequence Searching Tools: BLAST
BLAST, which stands for Basic local alignment tool and was developed by Altschul and colleagues in
1990. BLAST uses an approximation of the Smith–Waterman algorithm which makes is quite fast,
however this gain in speed is offset by a decrease in accuracy. Unlike the true Smith–Waterman
algorithm, BLAST is not guaranteed to find the optimal alignment between your query sequence
and the test sequences. However, it will find good alignments and provides a statistical means of
gauging your confidence in each alignment:
(1) It searches for ‘words’ of a user-defined
length (the shorter the word, the more
sensitive the search).
(2) It then extends these words in both
directions until it finds a mismatch.
(3) It then performs an approximation of the Smith–Waterman algorithm to create a gapped
alignment between the query sequence and the test sequence.
(4) Finally it calculates and reports the
probability of the alignment occurring by chance.
12
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
When doing a Blast search one can set the
“expectation threshold” (EXP THR) which
establishes a statistical significance threshold for
reporting database sequence matches. EXP THR
is interpreted as the upper limit on the expected
frequency of a chance occurrence of a match
within the context of the entire database search;
in other words, it sets an upper limit on the E-value. Any database sequence whose BLAST
alignment to the query sequence satisfies EXP THR is reported in the output file. An alignment with
an E value of ≥1.0 is expected to be found at least once by chance in the searched database and an E
value of ≥5.0 is expected to be found at least five times (see figure below). Raising this threshold
increases the likelihood of reporting distantly related matches, but the frequency of chance
matches reported will tend to grow at a much faster rate than real matches with EXP THR set above
1.0.
13
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Implementing the Methods for Sequence Searching Tools: FASTA
David Lipman and William Pearson (1988) developed FASTA which gets around the speed problem
by:
(1) FASTA breaks the query and test sequences into overlapping words and looks for exact matches.
(2) Then it re-scores these matches using a substitution matrix.
(3) Next it tries to join the highest-scoring segments. A ‘joining threshold’ set by the user eliminates
segments that are unlikely to be part of the alignment.
(4) Finally, FASTA uses the Smith–Waterman method to optimise the alignment, using only the part
of the matrix that contains the top-scoring segments.
BLAST & FASTA Sensitivity
14
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
FASTA therefore provides a means of performing a sensitive search against a large database in a
reasonable time. Nowadays, it is possible to approach the sensitivity of a FASTA search by using
BLAST and setting high sensitivity values. However, the
Sensitivity: There is a trade-off between
alignment statistics for FASTA can be considered more
sensitivity and search speed. Increasing the
robust than those for gapped-BLAST. This is because
sensitivity
makes
the
search
more
FASTA produces and scores an alignment of the query
computationally intensive and therefore slower.
sequence with a large sampling of the database, giving it
Decreasing the sensitivity, for example when
a distribution of scores that represents the entire
you are looking for almost exact matches, can
database:sequence range of alignments. BLAST is fast
dramatically increase the search speed. There is
because it does not bother producing an alignment for
also a trade-off between sensitivity and
most database sequences; alignments are only
specificity: increasing sensitivity tends to
triggered if the initial word-match criteria are met.
decrease specificity (greater propensity for
Consequently, BLAST does not have a complete
chance matches). So for example: if you are
distribution of alignment scores over the database from
looking for vector contamination you will
which to calculate the significance of the reported
choose low sensitivity, whereas if you are
matches. Instead, BLAST uses pre-computed values for
looking for long distance related sequence you
will opt for high sensitivity.
the score distribution rather than calculating values that
are specific to the search carried out.
15
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Sequence Searching Similarity tools at the EBI
The EBI provides you the option to use several sequence searching similarity tools at
www.ebi.ac.uk/Tools/sss. These are maintained by the External Service (ES) team. The ES team puts
considerable effort in making sure to provide you with the state of the art tools in sequence
searching and also with the flexibility to tailor your queries to the most appropriate search
parameters. Therefore you will find several options not only for the types of tools but also within the
tools, the amount of variable and parameters you can change.
16
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
We have talked about BLAST and FASTA but what are all the variations above? A quick way to make
a distinction among these tools is to label them by being either heuristic or rigorous. A fundamental
challenge in computer science is to make algorithms that find verifiable good solutions using a
proved bounded amount of computation time. A heuristic algorithm gives up one or both of these
goals. In other words heuristic is an algorithm that is able to produce an acceptable solution to a
problem in many practical scenarios, but for which there is no formal proof. Heuristics are typically
used when there is no known method to find an optimal solution, under the given constraints (of
time, space etc.) or at all. Rigorous on the other hands means to applying an algorithm that can
produce proof, that gives you the most optimal solution, therefore it is exhaustive and should
provide you with the best answer, however it is slow.
Following these definitions, the Sequence Searching Tools mentioned above are label as shown on
this panel:
17
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Using FASTA
Let’s have a go using FASTA for ourselves.
Instruction:
Navigate to the EBI sequence search tools page
You can either type in an address directly (www.ebi.ac.uk/Tools/sss/) search with the term FASTA, or
go through the services list as demonstrated.
The framework from
FASTA is launched by picking a type of database listed next to the FASTA
tool – this sets up the defaults appropriately for your choice, but you can
always change the database once in the tool if you wish.
Instruction:
Select the Protein database for FASTA
which we launch our
tools was revamped
recently, and now forms
a common basis for
many programs. For
more information about
this framework see:
A new bioinformatics
analysis tools framework
at EMBL–EBI (2010)
Goujon et al.
doi:10.1093/nar/gkq313
You should end up at a screen that looks like the following:
18
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
This screen will hopefully look quite familiar to you when conducting other searches as well. There
are four steps to submitting a job.
Step 1 – database selection
Here is where you select which databases to search against. You can select more than one, and you
can also expand subsections to narrow your search by taxonomic division for example. If you
changed your mind and want to do a different type of search (for example, against nucleotide
sequences) then you can select that here as well and it will reset the form.
19
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Step 2 – input
Here you can enter your query sequence. Most common formats are recognised, but please don’t
try to invent your own using Word! You can either paste/type the sequence directly, or upload a file
containing it from your computer using the browse button.
Step 3 –parameters
Most important here is choice of program – FASTA is grouped together with others like SSEARCH
(because they were authored by the same person and come in the same package). It should be set
up correctly according to the button you pressed to get to this page. The program will be set up with
some default parameters, however to look at these or change them you need to click the ‘More
options’ button.
You can click on any parameter title to get help on it. Here is a summary:
Matrix
The comparison matrix to be used to score alignments when searching
the database
Gap open
Penalty to start a gap in the alignment
Gap extend
Penalty for each base or residue in the gap
Ktup
‘Word size’ used to identify runs in the first stage of alignment
Expectation upper value
This allows you to ignore results that have above a certain expectation
score (ie become more distant)
Expectation lower value
This allows you to ignore results below a certain expectation (ie ignore
close relatives)
DNA strand
When searching DNA you can specify which strand is used. By default
both are searched
Histogram
Turn on/off the display of statistical histogram in FASTA results
Filter
Which low complexity filter to use
Statistical estimates
Which statistical method to use to evaluate values used in the Expect
score calculation
Scores
Maximum number of scores reported in the summary
Alignments
Maximum numbers of alignments reported in the summary
Sequence range
Allows you to specify which portion of the query sequence to use in
the search
Database range
Allows you to cut back on database sequences searched against by
specifying a number of residues range
20
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Step 4 – submit
When you’re happy with everything else, here is where you click to submit the job. For longer runs it
is recommended that you tick the box to send you an email when the job is complete. This will
contain a link back to the results so there is no need to keep your browser open. Email jobs are
usually stored on the servers for longer as well, while interactive job results are deleted more
quickly.
Once you’ve submitted your job the first thing that happens is that the input is validated – few
things are more frustrating than preparing a job and firing it off to then check your email later in the
day and find that you made a minor mistake somewhere and the job failed. If everything is okay then
the job will run and you will see a job running screen if you ran it interactively. Eventually the job will
finish and you will either be taken to the results page automatically or emailed a link to it.
Results
21
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Summary Table
There is a lot of information to take in for the results page! You are first presented with the summary
table which quickly lists the top results in a table format. You can change the ordering of the table by
clicking the arrows next to each column header – by default they are ranked by Expect value or E().
The first column contains a tick box which allows you to select database results for further actions,
for example to view the sequence annotation or detailed alignment with the query sequence. You
can also use the buttons to clear selections, select all, or invert selection. To download the selected
sequences click the download button.
The second column (DB:ID) gives the database ID of the sequence.
The third column (Source) gives some quick information about the sequence, as well as cross
references that have been found referring to it in other resources across the EBI – so you can quickly
look up more information about the sequence in these resources.
Length reports the length of the database sequence.
Score gives the literal score of the alignment.
Identities reports the number of identical residues that are found aligned between the query and
database sequence.
Positives reports the number of aligned residues that score positively in the substitution matrix (ie
similar types of residues).
E() gives the Expect score for the alignment – this is a measure of how likely you are to find that
alignment by chance. When the numbers are very small it reports them as 1.0E-10 for example. This
is the same as 1e-10, or 0.0000000001.
Tool Output
This tab switches the view to the raw, original output from FASTA – this can be useful when you
want to view the full text output from the program in case it contains something the summary or
other pages don’t cover. You can download it as a text or XML file.
Clicking on the
icon will jump you straight to the alignment. Clicking on the sequence ID will take
you to the original database entry for that sequence.
22
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Visual Output
This output gives you a nice way of visualising which portions of the sequence are aligning, as well as
colour-coding the alignment by Expect score.
Hovering your mouse over the sequence ID on the left-hand side will show a guide box around the
alignment. Clicking the mouse will take you to the alignment (from the raw output of the program).
23
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Functional Predictions
This tab is another example of how we are bringing together different resources at the EBI to give
you extra information. It searches a variety of resources for family and domain predictions and
shows you the results graphically, so you can easily see which portions of your sequence and
alignments correspond to these features.
Again, you can click on the links on the left-hand side to jump to the entry information at the
resource.
Submission Details
This page contains all the original parameters used to launch your job, together with easy links to
the exact input used and the output results.
This information is really useful for several reasons: If you want to repeat a job then you might want
to use exactly the same parameters; if you’re interested in running the command line version of the
tool then this will give you the exact command line used; and finally if you need help or support then
this page contains all of the information you need to give us to be able to help you quickly.
24
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Interpreting an alignment
The figure below shows a typical alignment
Key
-
Gap
:
Identity
.
Similarity
X
Filtered
The header shows some information about the database sequence, followed by some of the raw
scores from the program itself. The key bits of information are the E() score, the % identity and %
similarity numbers.
Below that is the actual alignment itself – the top line is the query sequence and the bottom is the
database sequence. Gaps in sequence are displayed with a ‘–‘ character (none in the above
alignment). Where two identical residues line up they are connected with a ‘:’ where two similar
residues line up they are connected with a ‘.’
Instruction:
Try running your own FASTA search using the provided sequence:



Search against the UniProtKB/Swiss-Prot database only – if you search against the
full UniProt Knowledgebase it will take a very long time!
Make sure the FASTA program is selected
Leave the parameters set at their defaults
25
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Question 5:
What are the default gap open, gap extend, ktup and matrix parameter settings for this
search?
Question 6:
What can you say about the likely function of this protein?
Question 7:
What are the DB:IDs, Scores, %identities, %positives and E() of the top two results?
ID
Score
Identity%
Positives%
E()
26
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Question 8:
Have a look at the actual alignments with the top two results – what can you say about
them?
27
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Using BLAST
Now that we are familiar with running FASTA searching, using BLAST should be very easy – the
interface is effectively the same. Simply select a database next to the BLAST tool of interest to enter
the tool.
NCBI BLAST
This version of BLAST is the version maintained at the NCBI
WU-BLAST
This version of BLAST was created by Dr Warren Gish, formerly of Washington University
Both versions can trace their history back to the same algorithm, but were developed separately,
often implementing ideas that first appeared in the other version.
The parameters are handled slightly differently in each case: NCBI BLAST provides access to
parameters like those used by FASTA or SSEARCH (eg gap open, gap extend). WU-BLAST hides direct
access to those, but instead provides a sensitivity parameter which combines several adjustments at
different stages of the algorithm. The raw results are slightly different between the two, but as we
parse the results they will appear the same to you (unless you look at the Tool Output page).
NCBI BLAST parameters
Matrix
The comparison matrix to be used to score alignments when searching
the database
Gap open
Penalty to start a gap in the alignment
Gap extend
Penalty for each base or residue in the gap
Exp. Thr (expectation
threshold)
This allows you to ignore results that have above a certain expectation
score (ie become more distant)
Filter
Which low complexity filter to use
Drop off
Controls how far a potential HSP is allowed to extend
Scores
Maximum number of scores reported in the summary
Alignments
Maximum numbers of alignments reported in the summary
Sequence range
Allows you to specify which portion of the query sequence to use in
the search
Gap align
Allows gapped extensions of alignments
Alignment views
Options for formatting the alignment output (Tool Output)
28
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
WU-BLAST parameters
Matrix
The comparison matrix to be used to score alignments when searching
the database
Exp. Thr (expectation
threshold)
This allows you to ignore results that have above a certain expectation
score (ie become more distant)
Filter
Which low complexity filter to use
View filter
Display any sequence filtered out (in Tool Output)
Sensitivity
General parameter affecting search sensitivity – this makes
adjustments to several internal parameters
Scores
Maximum number of scores reported in the summary
Alignments
Maximum numbers of alignments reported in the summary
Sort
Choose which value to sort the Tool Output results by.
Stats
Choice of statistic methods used in generation of Expect statistics
topcomboN
In WU-BLAST HSPs are classified into a number of sets, you can use this
parameter to restrict the display to only the N highest scoring sets.
Alignment views
Options for formatting the alignment output (Tool Output)
BLAST results
Unsurprisingly, the results from BLAST runs on our servers are displayed in the same way as FASTA
results, and all the tabs are equivalent.
The main differences in format will only appear if you look at the raw output via the Tool Output
page.
29
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Key
-
Gap
[residue] Identity
+
Similarity
X
Filtered
The above picture shows the alignment section of an NCBI BLAST run. This time identical residues
are actually listed between the two sequences, which are labelled Query for query sequence and
Sbjct (subject) for database sequence. Similar residues (Positives) are indicated with a +. The number
of gaps inserted into the overlapping sequence regions is also reported.
Instruction:
Try running your own NCBI BLAST search using the provided sequence:



Search against the UniProtKB/Swiss-Prot database only – if you search against the
full UniProt Knowledgebase it will take a very long time!
Make sure the BLASTP program is selected
Leave the parameters set at their defaults
30
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Question 9:
What are the default gap open, gap extend, drop off and matrix parameter settings for
this search?
Question 10:
What are the DB:IDs, Scores, %identities, %positives and E() of the top two results?
ID
Score
Identity%
Positives%
E()
Question 11:
Are these different from our FASTA search earlier?
31
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Differences between BLAST and FASTA
The following table summarises the key differences between BLAST and FASTA
BLAST
FASTA
Fast
Good with proteins
Might miss potential alignment
Not as fast as BLAST
Good with proteins and DNA
Aligns against all database
sequences
Produces S&W alignments
Good for cousins
Produces HSPs
Good for siblings
When to use what?
In general the larger the database the faster the algorithm you should use, and likewise the larger
the query sequence the faster the algorithm you should use. For very small queries or databases
then dynamic programming methods like SSEARCH can be great.
32
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
PSI-BLAST
Position-Specific Iterative BLAST, or PSI-BLAST, is a clever tool which allows you to create your own
custom scoring matrix based on the conservation of residues you find in your own searches, rather
than some model made with different sequences.
PSI-BLAST workflow
It starts with a normal BLAST, however you can then
1. Normal BLAST search
select which sequences in the results will be used to
build a profile. These sequences are then aligned and
conserved residues at each position are scored more
highly in a new type of matrix which allows for
different scores at different positions in the
sequence.
A new BLAST search is run with this matrix, called a
Position Specific Scoring Matrix or PSSM. The results
can themselves be used to create another PSSM for
another run, and so the process is iterative.
The aim of PSI-BLAST is concentrate the alignment on
positions that are important, while allowing for more
variability in areas that aren’t so important. So a
functional area or binding motive might be more
important than sequence that forms part of a loop
for example.
2. Align selected results
Searches made with a PSSM can find matches with
sequences that were scored too low to be
considered in a normal BLAST search, but have
scored more highly with the new matrix – these are
marked as ‘new’ by the PSI-BLAST tool.
3. Create PSSM
You can also save your search and continue it at a
later date, or save the PSSM itself.
(Continued on next page)
33
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
PSI-BLAST workflow (cont.)
The parameters for PSI-BLAST are the same as
NCBI BLAST, with the addition of a new
threshold:
4. Use PSSM to score new BLAST
alignment
PSI-BLAST Threshold
This expectation value controls the default
selection of sequences to be used for generation
of the PSSM – sequences scoring higher than this
(ie don’t align as well) won’t be included.
Once the first iteration is run, additional controls
over a normal BLAST result appear – the PSIBLAST Threshold can be changed again, or
individual sequences can be added or removed
from the selection by ticking the box in the first
Summary Table column.
Go to Step 2 if required
The View Threshold limit button jumps the view down the table so you can see the cut off.
34
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Controls to download a checkpoint file or PSSM only appear after the second iteration.
Instruction:
Try running your own PSI-BLAST search using the provided sequence:




Search against the UniProtKB/Swiss-Prot database only – if you search against
the full UniProt Knowledgebase it will take a very long time!
Make sure the PSI-BLAST program is selected
Leave the parameters set at their defaults
For the moment, stop after the first run
35
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Question 12:
Looking at the first run (normal BLAST results), how many sequences score above our
default threshold of 1.0e-3?
Instruction:
For the second iteration you can choose which sequences to include in the PSSM
generation.



Untick the top scoring sequence (simply because it scores so much better than the
other results – you wouldn’t necessarily normally do this)
Leave everything else set to defaults
Click the ‘run next iteration’ button
Question 13:
Looking at the second iteration, how many sequences now score above our default
threshold of 1.0e-3? (Hint: use the View Threshold Limit button). What is a likely explanation?
Question 14:
Have any new sequences been scored well enough to appear in our results?
36
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Problems with iterative searches
Iterative searches like PSI-BLAST aim to use profile construction methods to increase search
sensitivity, however they can inadvertently decrease selectivity, particularly if the profile becomes
contaminated with information that is not relevant to homology with our query sequence. One
situation where this occurs is described as Homologous Over-Extension(HOE).
Homologous Over-Extension (HOE)
Iterative search strategies using profiles (ie the PSSM in PSI-BLAST) might help increase the
sensitivity of a search, however while the aim is to have the profile reflect areas of interest (a
domain for example) there is a danger that it will be contaminated with information that is not
relevant to your query. Low complexity regions are one example of this, but these can be fixed with
the use of filters. Another cause of contamination that was recently described is Homologous OverExtension (HOE).
HOE can occur in a profile based alignment when the alignment region picks up a portion of
sequence that is not biologically relevant to our query but that is conserved in other sequences
brought back by the search. The influence of this region on the scoring matrix can be such that the
alignment region extends even further beyond the domain of interest. This can even begin to cover a
domain that is not present in the query sequence, once this happens the weighting of the scoring
matrix can influence the alignment so much that sequences not at all biologically related to the
query start to be found as significant, resulting in an increase in false-positives.
Ideally this is prevented by careful selection of which sequences to include in the generation of the
PSSM and making sure that they do not have other domains near the boundaries of the alignment
that might cause alignment extension – our functional prediction page might help with this. But as
this is a manual method, and domain information might not be present in the functional predictions,
we have created a method to automatically reduce the likelihood of HOE occurring by masking
boundaries at the edge of the original alignment. At the moment this method only applies to PSISearch – a tool that combines sensitive Smith-Waterman based local alignment with the PSI-BLAST
profile construction strategy, but it can be enabled by toggling the option ‘HOE region masking’ to
yes (which is the default setting).
37
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Filters
Filtering can eliminate statistically significant but biologically uninteresting reports from the BLAST
output by masking out various segments of the query sequence for regions which are non-specific
for sequence similarity searches. This leaves the more biologically interesting regions of the query
sequence available for specific matching against database sequences. For example, it may be desired
to mask acidic, basic or proline-rich segments of a protein that would otherwise yield overwhelming
amounts of uninteresting, non-specific matches against a wide array of protein families. The SEG
program (Wootton and Federhen, 1993) masks low compositional complexity regions, while XNU
(Claverie and States, 1993) masks regions containing short-periodicity internal repeats. SEG+XNU
will combine the above two. The DUST program by Tatusov and Lipman can only be used with DNA
searches and will mask simple repeats in DNA/RNA sequences.
38
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Instruction:
Perform a FASTA search against the UniProtKB/Swiss-Prot database for the filtertest
sequence that the demonstrator has provided.



Make sure to select more options and set the Histogram display to YES
You could also change the Expectation upper value to 0.001 to help make the results
clearer
To see the histogram, go to the Tool Output tab in the results
Question 15:
Describe how the observed vs expected histogram looks? What does this mean? How
many results have an alignment with an expect score better than 0.001?
Instruction:
Repeat the search, but this time use the SEG filter from the more options parameters.

Make sure that the Histogram display is still set to YES, and expectation value to
0.001 if you want to clearly compare.
Question 16:
Now how does the observed vs expected histogram look? How many results have an
alignment with an expect score better than 0.001?
39
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Vector Contamination
Another reason you might not get the results you are expecting is due to vector contamination – a
common problem if your sequence is fresh from the sequencer.
One way to check for this problem if you suspect something is to search against a specialist dataset
containing vector sequences only – at the EBI the EMVEC database does exactly this, and there is an
NCBI mode to perform this role.
Question 17:
A student has given you two sequences and they have forgotten whether they have
already trimmed them for vector contamination. Use the BLAST tools at EBI to determine
whether they have vector contaminants or not.
Sequence 1:
Sequence 2:
40
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI
Getting HELP




Read the database Documentation
Frequently Asked Questions: http://www.ebi.ac.uk/help/faq.html
EBI Support: http://www.ebi.ac.uk/support/
Hands-on training programme: http://www.ebi.ac.uk/training/handson/
Related articles from the EBI
PSI-Search: iterative HOE-reduced profile SSEARCH searching
[http://dx.doi.org/10.1093/bioinformatics/bts240]
A new bioinformatics analysis tools framework at EMBL-EBI
[http://dx.doi.org/10.1093/nar/gkq313]
The European Bioinformatics Institute’s data resources
[http://dx.doi.org/10.1093/nar/gkp986]
Web services at the European Bioinformatics Institute-2009
[http://dx.doi.org/10.1093/nar/gkp302]
The Universal Protein Resource (UniProt) in 2010
[http://dx.doi.org/10.1093/nar/gkp846]
The IntAct molecular interaction database in 2010
[http://dx.doi.org/10.1093/nar/gkp878]
The Gene Ontology in 2010: extensions and refinements
[http://dx.doi.org/10.1093/nar/gkp1018]
The Proteomics Identifications database: 2010 update
[http://dx.doi.org/10.1093/nar/gkp964]
InterPro: the integrative protein signature database
[http://dx.doi.org/10.1093/nar/gkn785]
Reactome knowledgebase of human biological pathways and processes
[http://dx.doi.org/10.1093/nar/gkn863]
Further reading










Needleman, S. B. and Wunsch, C. D. (1970) A general method applicable to the search for similarities
in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453
Smith, T. F and Waterman, M. S. (1981) Identification of common molecular subsequences. J. Mol.
Biol. 147, 195–197
Pearson, W. R. and Lipman, D. J. (1988) Improved tools for biological sequence comparison. Proc. Natl
Acad. Sci. U. S. A. 85, 2444–2448
Ning, Z., Cox, A. J. and Mullikin, J. C. (2001) SSAHA: a fast search method for large DNA databases.
Genome Res. 11, 1725–1729
Kent, J. (2002) BLAT – the BLAST-like alignment tool. Genome Res. 12, 656–664
Dayhoff, M. O., Schwartz, R. M. and Orcutt, B. C. (1978) A model for evolutionary change in proteins.
in Atlas of Protein Sequence and Structure, (Ed. Dayhoff, M. O.) Vol. 5, pp. 345–352 (National
Biochemical Research Foundation)
Henikoff, S. and Henikoff, J. G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl
Acad. Sci. U. S. A. 89, 10915–10919
Altschul,S.F., Warren,G., Webb,M., Eugene,W.M. and Lipman,D.J. (1990) Basic local alignment search
tool. J. Mol. Biol. 215:403–410
Altschul,S.F., Madden,T.L., Schäffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped
BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res.
25(17):3389-402.
Lopez,R., Silventoinen,V., Robinson,S., Kibria,A. and Gish,W. (2003) WU-Blast2 server at the European
Bioinformatics Institute. Nucleic Acids Res. 31(13):3795-8 .
41
Introduction to Sequence Searching and Alignments - Andrew Cowley
EMBL-EBI





Mackey,A.J., Haystead,T.A. and Pearson,W.R. (2002) Getting more from less: algorithms for rapid
protein identification with multiple short peptide sequences. Molecular and Cellular Proteomics
1(2):139-147.
Brown,N.P., Leroy,C. and Sander,C. (1998) MView: a web-compatible database search or multiple
alignment viewer. Bioinformatics 14(4):380-381.
Thompson,J.D., Plewniak,F., Thierry,J.C. and Poch,O. (2000) DbClustal: rapid and reliable global
multiple alignments of protein sequences detected by database searches. Nucleic Acids Res.
28(15):2919-2926.
Mickael Goujon, Hamish McWilliam, Weizhong Li, Franck Valentin, Silvano Squizzato, Juri Paern and
Rodrigo Lopez (2010) A new bioinformatics analysis tools framework at EMBL–EBI. Nucleic Acids Res.
31(13):3795-8 .
Mileidy W. Gonzalez and William R. Pearson (2010) Homologous over-extension: a challenge for
iterative similarity searches. Nucleic Acids Res. 2010 April; 38(7): 2177–2189
42
Introduction to Sequence Searching and Alignments - Andrew Cowley
Download