Sequence Searching & Alignments - European Bioinformatics Institute

advertisement
Bioinformatics Roadshow
Sequence Searching and Alignments, May 2012
EBI Bioinformatics Roadshow
Contents
INTRODUCTION AND COURSE CONCEPT .......................................................................................................... 4
ABOUT THIS VERSION ...................................................................................................................................... 4
BIOINFORMATICS ............................................................................................................................................ 5
EMBL-EBI ......................................................................................................................................................... 7
NUCLEOTIDE SEQUENCES AND DATABASES .................................................................................................................... 8
ENA ................................................................................................................................................................ 8
PROTEIN SEQUENCES AND DATABASES ....................................................................................................................... 10
SEARCHING DATABASES ................................................................................................................................ 11
EBI SEARCH.......................................................................................................................................................... 12
EBI Search – advanced ................................................................................................................................. 15
DATABASE WEB PAGES ........................................................................................................................................... 17
HOMOLOGY VS. SIMILARITY..................................................................................................................................... 20
WHY MIGHT YOU NEED SEQUENCE SIMILARITY SEARCHING? ........................................................................................... 21
SEQUENCE SIMILARITY SEARCHING: AN OVERVIEW ....................................................................................................... 21
WAYS OF MATCHING SEQUENCES ............................................................................................................................. 22
Optimal Global Alignment ........................................................................................................................... 23
Optimal local alignment............................................................................................................................... 23
Substitution matrices ................................................................................................................................... 24
USING ALIGNMENT TOOLS AT THE EBI ........................................................................................................... 26
Step 1 – input ............................................................................................................................................... 27
Step 2 –parameters ...................................................................................................................................... 27
Step 3 – submit ............................................................................................................................................ 28
RESULTS .............................................................................................................................................................. 29
Results page ................................................................................................................................................. 29
Alignment ..................................................................................................................................................... 29
Submission details ........................................................................................................................................ 30
Implementing the Methods for Sequence Searching Tools: BLAST .............................................................. 33
Implementing the Methods for Sequence Searching Tools: FASTA .............................................................. 35
BLAST & FASTA Sensitivity ............................................................................................................................ 35
Sequence Searching Similarity tools at the EBI ............................................................................................ 37
USING FASTA ................................................................................................................................................. 39
Step 1 – database selection ......................................................................................................................... 40
Step 2 – input ............................................................................................................................................... 40
Step 3 –parameters ...................................................................................................................................... 40
Step 4 – submit ............................................................................................................................................ 41
RESULTS .............................................................................................................................................................. 42
Summary Table ............................................................................................................................................ 42
Tool Output .................................................................................................................................................. 43
Visual Output ............................................................................................................................................... 44
Functional Predictions .................................................................................................................................. 45
Submission Details ....................................................................................................................................... 45
Submit Another Job ...................................................................................................................................... 45
INTERPRETING AN ALIGNMENT ................................................................................................................................. 46
2
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
USING BLAST.................................................................................................................................................. 49
NCBI BLAST parameters ............................................................................................................................... 49
WU-BLAST parameters................................................................................................................................. 50
BLAST RESULTS .................................................................................................................................................... 50
DIFFERENCES BETWEEN BLAST AND FASTA .............................................................................................................. 53
When to use what? ...................................................................................................................................... 53
PSI-BLAST....................................................................................................................................................... 54
PSI-BLAST Threshold ..................................................................................................................................... 55
FILTERS .......................................................................................................................................................... 58
HOMOLOGOUS OVER-EXTENSION (HOE) ....................................................................................................... 59
VECTOR CONTAMINATION ............................................................................................................................ 60
MULTIPLE SEQUENCE ALIGNMENT................................................................................................................. 62
CLUSTALW ..................................................................................................................................................... 63
RESULTS .............................................................................................................................................................. 66
Alignments ................................................................................................................................................... 66
Result Summary ........................................................................................................................................... 67
Jalview .......................................................................................................................................................... 67
Guide Tree .................................................................................................................................................... 68
Submission Details ....................................................................................................................................... 70
Submit Another Job ...................................................................................................................................... 70
Clustal Omega .............................................................................................................................................. 73
T-Coffee ........................................................................................................................................................ 73
MUSCLE ........................................................................................................................................................ 73
MAFFT .......................................................................................................................................................... 73
Kalign ........................................................................................................................................................... 73
WEBPRANK .................................................................................................................................................... 76
Results .......................................................................................................................................................... 77
RELATED ARTICLES FROM THE EBI ................................................................................................................. 79
FURTHER READING ........................................................................................................................................ 79
APPENDIX ...................................................................................................................................................... 80
NUCLEOTIDE CODES ............................................................................................................................................... 80
AMINO ACID CODES ............................................................................................................................................... 81
3
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Introduction and course concept
Welcome to this Roadshow on Bioinformatics and more specifically, a series of resources and tools
provided and/or hosted by the European Bioinformatics Institute (EBI) http://www.ebi.ac.uk/. This
course is funded by the European Commission under SLING, “Serving Life-science Information for the
Next Generation”.
The EBI maintains the world’s most comprehensive range of molecular databases. As we move
towards understanding biology at the systems level, access to large data sets of many different types
has become crucial. Technologies such as genome-sequencing, microarrays, proteomics and
structural genomics have provided ‘parts lists’ for many living organisms.
This course in particular aims to provide you with an introduction to searching sequences and
alignments at the EBI which we hope will be useful for you in your immediate or future research.
There is really so much we would like to show you and at least make you aware that it exists, a few
days, however, are not much, therefore do come and talk to us during the coffee and lunch breaks
and of course, do not hesitate to ask questions.
What we expect from YOU: In order to make sure you and your fellow participants can gain the most
from this roadshow, do have a look at the signs below:
HHHHHH!!!
Switch off/silence your phones
We hope you enjoy the course, and remember: The more you interact, the better the course will be!
About this version
v1.6 May 2012
Contributing authors
Andrew Cowley – Bioinformatics trainer
Hamish McWilliam – Software engineer
Vicky Schneider – Training Programme project leader
4
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Bioinformatics
Knowledge discovery starts by collecting, selecting and cleaning the data to finally fill a database. A
database is a collection of files of consistent data that are stored in a uniform and efficient manner.
A relational database consists of a set of tables, each storing records. A record is represented as a
set of attributes which define a property of a record. Attributes can be identified by their name and
store a value. All records in a table have the same number and type of attributes.
Primary or archived databases contain information and annotation of DNA and protein sequences,
DNA and protein structures and DNA and protein expression profiles straight from experimental
results. Secondary or derived databases are so called because they contain the results of analysis on
the primary resources themselves, including information on sequence patterns or motifs, variants
and mutations and evolutionary relationships.
Question 1:
Which data resources do you know of?
A very important characteristic of a database record is a unique identifier. This is crucial in Biology
as in many other disciplines given the large amount of situations where an entity has many names,
or a name can refer to multiple entities. The use of an accession number, a primary key derived by a
reference database to describe the appearance of that entity in that database helps to overcome
this problem. Unfortunately different databases may use their own conventions when assigning
identifiers, so resolving which occur in difference resources can be a major challenge.
The Protein Identifier Cross-Reference Service (PICR) at the EBI (http://www.ebi.ac.uk/Tools/picr/)
is a tool to help find accession numbers and cross-references across databases. This is particularly
useful when you are using programs to interface with these databases automatically.
Database design is a critical step that covers:
i) Defining (conceptual design) of the data requirements of the application, including the
entities and their relationships.
ii) Logical design is the implementation of the database using Data management systems
(DBMS) which ensure the process to be scalable.
iii) The physical design phase estimates the workload and refines the database design
accordingly. During this phase design is optimized, indexing implemented and clustering
approaches are optimized. These are fundamental steps in order to obtain fast
responses to frequent queries without risking the database integrity.
5
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Top Challenges
1) Precise, predictive model of
transcription
initiation
and
termination: the ability to predict
where and when transcription will
occur in a genome (fundamental
for HTS and Proteomics);
2) Precise, predictive model of
RNA splicing/alternative splicing:
ability to predict the splicing
pattern of any primary transcript in
any tissue (fundamental for
Transcriptomics and Proteomics);
3) Precise, quantitative models of
signal transduction pathways:
ability to predict cellular responses
to external stimuli (required in
proteomics and pathways analysis);
4) Determining
effective
protein:DNA, protein:RNA and
protein:protein recognition codes
(important for recognition of
interactions among the various
types of molecules);
Once we have the data we can then start examining it. For example,
Clustering methods are used to identify patterns in the data, in other words
to recognise what is similar, to identify what is different and from there to
investigate whether differences are significant and biologically meaningful.
Biological research tries to recognise patterns in order to infer relationships
with previously-characterised sequences. It is essential to keep in mind that
identifying similarity between sequences (e.g. nucleotide or amino acid
sequences) is not necessarily equivalent to identifying other properties of
such sequences, for example their function.
A crucial step in bioinformatics is to pick the appropriate representation of
the data. A simple but highly resourceful approach has been the use of
controlled vocabularies (CVs), which provide a standardised dictionary of
terms for representing and managing information. Ontologies are structurecontrolled vocabularies. As well as controlling the vocabularies, the data itself
has to be standardised via different formats.
Question 2:
What data formats have you encountered before?
5) Accurate ab initio protein
structure prediction (required for
proteomics and pathways analysis);
6) Rational design of small
molecule inhibitors of proteins
(Chemigenomics);
7) Mechanistic understanding of
protein evolution: understanding
exactly how new protein functions
evolve (comparative genomics);
8) Mechanistic understanding of
speciation: molecular details of how
speciation occurs (Comparative
genome
sequences,
sequence
variation);
9) Continued development of
effective gene ontologies systematic ways to describe the
functions of any gene or protein
(genomics,
transcriptomics,
proteomics).
With the advances in molecular biology, in particular the ever-growing
progress in high throughput technologies, bioinformatics is continuously
challenged on the ability to efficiently access large sets of data, allowing
analysis of thousands items at a time, perhaps with complex constraints.
Without bioinformatics it would be impossible to make sense of the huge
amount of data produced (e.g. Omics research). Look at the increase in size
of the EMBL Nucleotide Sequence Database (EMBL-Bank): Release 101 on
26-Aug-2009 contained 163,656,234 sequence entries comprising
283,748,816,763 nucleotides.
Bioinformatics provides the structure that enabled storage of information in
such a way that is retrievable, and comparable not only to similar data but
also to other types of information. In 2002 Burge and colleagues produced a
list defining in their view “Top 10 Future Challenges for Bioinformatics”.
These currently do not essentially differ from those listed by Burge et al
(2002), but simply stretched to meet the challenges inflict by the volume of
data produced (see Panel on the left).
6
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
EMBL-EBI
The EMBL-EBI is one of the outstations of the European Molecular Biology Laboratory (EMBL).
Today, the European Bioinformatics Institute (EBI) holds the world’s most comprehensive collection
of databases and resources.
EBI shares its central four mission objectives with EMBL,
although focussed on bioinformatics rather than
molecular biology:
 To provide freely available data and bioinformatics
services to all facets of the scientific community in
ways that promote scientific progress.
 To contribute to the advancement of biology
through basic investigator-driven research in
bioinformatics.
 To provide advanced bioinformatics training to
scientists at all levels, from PhD students to
independent investigators.
 To help disseminate cutting-edge technologies to
industry.
But how did it all start? Walgate (1982) communicated in
Nature the pressure in Europe to set a suitable storing
system for sequence data and the need to do it in
collaboration to the relevant journals. One of the
fundamental aspects of the discussion among interested
parts was the establishments of a central data bank where
sequences could be deposited and made freely available
to the whole scientific community. Through “information
engineering” the technical aspects were overcome,
transforming rough drafts to the final computerised
format we have today.
7
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Nucleotide sequences and databases
The EMBL Nucleotide Sequence Database (also known as EMBL-Bank) constitutes Europe's primary
nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions
from individual researchers, genome sequencing projects and patent applications.
The database is produced in an international collaboration with GenBank (USA) and the DNA
Database of Japan (DDBJ), forming the INSDC (International Nucleotide Sequence Database
Collaboration). Each of the three groups collects a portion of the total sequence data reported
worldwide, and all new and updated database entries are exchanged between the groups on a daily
basis. The current database release (Release 103, March 2010), with accompanying release notes
and user manual are available from the EBI servers. A publication in Nucleic Acids Research 2009 37:
D19-D25 provides further information and details.
The EMBL nucleotide sequence database, together with the Sequence Read Archive (SRA) and Trace
Archive, forms part of the European Nucleotide Archive (ENA), and at the EBI it is maintained by the
Protein and Nucleotide Database Group (PANDA) under Ewan Birney.
ENA
ENA
captures
and
presents
information relating to experimental
workflows that are based around
nucleotide sequencing. A typical
workflow includes the isolation and
preparation
of
material
for
sequencing, a run of a sequencing
machine in which sequencing data are
produced
and
a
subsequent
bioinformatic analysis pipeline. ENA
records this information in a data
model that covers:
i) input information (sample, experimental setup, machine configuration), ii) output machine data
(sequence traces, reads and quality scores) and iii) interpreted information (assembly, mapping,
functional annotation).
Data arrives at the ENA from a variety of sources. These include submissions of raw data, assembled
sequences and annotation from small-scale sequencing efforts, data provision from the major
European sequencing centres and routine and comprehensive exchange with our partners in the
INSDC. Provision of nucleotide sequence data to ENA or its INSDC partners has become a central and
mandatory step in the dissemination of research findings to the scientific community. ENA works
with publishers of scientific literature and funding bodies to ensure compliance with these principles
and to provide optimal submission systems and data access tools that work seamlessly with the
published literature. Although the ENA has almost 30 years of history, the data and services are
constantly changing to reflect growing volumes of data, ever improving sequencing technology and
the broadening of applications to which sequencing is now put.
8
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
As part of the global effort to improve access to and usability of nucleotide sequencing data, we
collaborate extensively in the development of our services and technologies and in standards
activities. Data submitted to ENA undergo a variety of validation steps that include automated
quality checking and, where possible, manual inspection and curation. More information about the
ENA and how to access the data can be found here: http://www.ebi.ac.uk/ena/
9
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Protein sequences and databases
Just as we have the ENA for nucleotide sequence information, the EBI also contains resources that
deal with protein sequences. Most equivalent to EMBL-Bank is the UniProt Knowledge Base or
UniProtKB. UniprotKB consists of two sections:
TrEMBL is made up of automatically translated and annotated EMBL-Bank entries.
Swiss-Prot is manually annotated, and importantly, reviewed, by experts. Because of this it’s much
smaller than TrEMBL, but the entries are higher quality.
As well as the core knowledgebase there are other supporting databases:
UniRef contains sequences clustered together at different identity levels.
UniParc is an archive of unique sequences (only), with a unique identifier for each one.
Until a few years ago, EBI and SIB together produced Swiss-Prot and TrEMBL, while PIR produced the
Protein Sequence Database (PIR-PSD). These two data sets coexisted with different protein
sequence coverage and annotation priorities. TrEMBL (Translated EMBL Nucleotide Sequence Data
Library) was originally created because sequence data was being generated at a pace that exceeded
Swiss-Prot's ability to keep up. Meanwhile, PIR maintained the PIR-PSD and related databases,
including iProClass, a database of protein sequences and curated families. In 2002 the three
institutes decided to pool their resources and expertise and formed the UniProt Consortium.
10
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Searching databases
Having the data in databases is one thing, but that’s only half the story - we need to be able to
retrieve it for it to be useful.
Question 3:
What methods of searching databases do you know of?
At the EBI there are many different ways of retrieving data. The easiest method is if you already have
the unique identifier, as you can then perform a lookup on this information and retrieve the exact
sequence of interest. But what if you don’t have that information?
The information in sequence databases can be thought of as being contained in two parts – the
sequence data itself, and the annotation or meta-data which talks about that sequence. This
annotation could be anything from accession number and title to keywords describing something of
interest about the sequence. If you are interested in a particular field then you can search for terms
in a few different ways at the EBI.
11
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
EBI search
The EBI Search is on the main page, and the toolbar at the top of every page at the EBI, it allows you
to very quickly search all the main databases in one go.
Instruction:
Open a browser and navigate to www.ebi.ac.uk
It should look something like this:
12
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Instruction:
Conduct a search for the term:
BRCA1
The results will look something like this:
The results are split into several categories, listed on the top left, while the main pane shows you
example entries in each of the categories. They can be expanded by clicking the ‘View all results’ link
13
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
at the bottom of the section. You can narrow results to a particular category by clicking on the
category name in the top-left panel.
You can then view each entry directly in the selected database by clicking on the name of the entry,
or can use the View links to explore different ways of seeing the data, for example the raw text file,
or through other databases at the EBI which look at the same data.
Below the View links are the References – these are links to database entries that reference this
sequence, for example an InterPro pattern or GO term.
Instruction:
Spend a minute exploring the results
Question 4:
How many entries are there in the nucleotide databases for BRCA1?
14
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
EBI Search – advanced
If you want more control over your search, you can perform an advanced search from the results
page of any normal search by clicking the advanced search link.
This brings up a control that allows you to search for exact phrases or to exclude specific words from
the search.
15
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
For even more control, you can search specific domains and fields by following the link from the
advanced search page.
16
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
If you really want to, you can form complex queries using the default search bar with the right query
syntax. This is beyond the scope of this tutorial, but you can read more about it here:
http://lucene.apache.org/java/2_4_0/queryparsersyntax.html
Once you’ve done an advanced search the query syntax that made up the search is shown in the
toolbar.
Question 5:
How many entries are there in the EMBL-Bank (last release, normal divisions) which
have ‘BRCA1’ in their gene field? Why is this different from our answer to question 4?
Database web pages
So far we’ve looked at site-wide database searching, but each individual database (or group of
similar databases) has its own web area at the EBI as well, and many of these contain their own
powerful search methods.
There are several routes to reach database web pages from the main EBI page.
The most commonly used databases can be reached directly from the left hand column of links.
17
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
The middle section contains links to groups of databases/resources organised by type.
And of course there is the Databases drop down menu, which also contains links to groups of
databases/resources organised by type:
Finally if you’re looking for a specific database and can’t find it using the above, there is a full list of
database web pages available by following the drop down from Databases > Databases Index and
then selecting the Databases A-Z link on the left of the resulting page.
18
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Instruction:
Spend a minute exploring different ways of navigating to databases
19
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Sequence searching & Alignments
You will learn about:
•
Homology vs. Similarity
•
global and local score alignments.
•
substitution matrices (PAM & Blossum).
•
the algorithms used in search tools such as BLAST and FASTA.
•
Multiple sequence alignments (MSAs)
•
The tools available at the EBI for all of the above
Homology vs. Similarity
Since the mid-19th century, zoologists and botanists have learned to make a distinction between
homologous organs (e.g. bat's wing and human's hand) and similar (analogous) organs (e.g. bat's
wing and butterfly's wing). Homologous organs are not necessarily similar (at least the similarity may
not be obvious); similar organs are not necessarily homologous.
Phrases like “sequence (structural) homology”, “high homology”, “significant homology”, or even
“35% homology” are as common, even in top scientific journals, and they are incorrect, considering
the above definition. The term “homology” is used basically as a glorified substitute for “sequence
(or structural) similarity”.
20
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Why might you need sequence similarity searching?
Sequence similarity searching is used to infer relationships between sequences. Armed with a new
piece of nucleic acid or protein sequence, you can compare it with all the sequences in the publicly
available databases to find out whether anything similar or identical has been sequenced before.
You can infer evolutionary relationships and functional information.
Let’s imagine that you are studying the effects of rhinovirus infection on koalas. You have just
sequenced a collection of expressed sequence tags (ESTs) from infected
Rhinoviruses are the most
koala bronchioles. There isn’t much koala DNA sequence in the public
common viral infective
domain, but nevertheless, you might expect to find that some of your EST
agents in humans, and a
sequences are identical to predicted transcripts from the genomic
causative agent of the
sequences of marsupials that have been sequenced. If these marsupial
common cold. It is lytic in
genes haven’t been very well characterised,
nature. There are 99
ESTs are small portion of
you could look for similarities between your
recognized types of
an entire gene that can be
koala ESTs and those of a well-characterised
rhinoviruses that differ
used to help identify
species, such as human or mouse. This might
based on their varying
unknown genes and to
help you to identify some of the genes that
surface proteins.
map their positions within
are expressed in your sample. You might also
a genome.
find some sequences that share only small regions of similarity to
mammalian proteins, but bear a close resemblance to viral sequences. In short, you can learn a lot
about the types of genes expressed in your sample through sequence similarity searching.
The applications of sequence similarity searching are numerous, ranging from the characterization
of newly sequenced genomes, through phylogenetics, to species identification in environmental
samples. In this tutorial, we will guide you through the basic principles of how the most widely used
sequence similarity programmes work, and will help you to practice using these tools.
Sequence similarity searching: an overview
All computational methods of sequence similarity searching try to align your query sequence with
all of the sequences in a database. In the very simplest case, if we imagine our query sequence and
a sample database sequence represented on two sides of a grid, we can plot which nucleotides (or
amino acid residues) are identical in the two sequences. This is known as a dot plot.
It’s immediately obvious that there’s more than one
way of aligning even a short sequence. There are
two different algorithms that are commonly used in
creating dotplots:
A Dot plot is a way of visualising a pairwise
sequence alignment. A grid is created with a
column for each position of one sequence and a
row for each position in the other. Matches can
then be marked in the appropriate square of
the grid.
21
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
1) The first method involves matching identical regions of sequence and plotting a dot in these
areas.
2) The second involves using “sliding windows”
A score is calculated between two sequences using a
to compare two sequences using a threshold
two dimensional array of numbers called a matrix.
score value. A window size is selected as a run
The matrix for DNA is relatively simple, with each
of adjacent nucleotide or amino acid residues,
exact base match scoring 5 and each mismatch
and a score chosen to reflect the degree of
scoring –4. Various other scores represent matches
similarity of sequence required. Each window
between ambiguity codes. Matrices for amino acids
of sequence A is compared to each window of
are more complex, and are discussed later in the
sequence B, and a dot is only placed in that
chapter.
region if the match scores or exceeds the set
threshold level.
Ways of matching sequences
There is no unique, precise, or universally applicable notion of similarity. An alignment is an
arrangement of two sequences, which shows where the two sequences are similar, and where they
differ. An optimal alignment, of course, is one that exhibits the most similarities, and the least
differences. Broadly, there are three categories of methods for sequence comparison.
• Segment methods compare all overlapping segments of a predetermined length (e.g., 10 amino
acids) from one sequence to all segments from the other. This is the approach used in dotplots.
• Optimal global alignment methods allow the best overall score for the comparison of the two
sequences to be obtained, including a consideration of gaps. These programs align sequences over
their whole length.
• Optimal local alignment algorithms seek to identify the best local similarities between two
sequences also including explicit consideration of gaps. Alignment may only be over a short span of
sequence.
Question 6:
Which of the following is a global alignment and which local?
A
B
A T G T A T A C G C
A - T G T A T A C G C
A G T A T A - G C
A G T A T A - - - G C
22
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Optimal Global Alignment
Let’s look at one example for working out an optimal global alignment. Imagine we have the
following query sequence and database sequence shown below. To work out the best alignment,
we can assign a score to each dot in the grid, say 1 for every match, -3 for every mismatch and -5
for every gap. Adding these scores as we progress from top left to bottom right, and for each point
choosing the maximum possible score, we produce a matrix with the maximum alignment score in
the bottom right hand corner. Tracing back through the matrix from that maximal score reveals the
alignment that produces the maximum score, this is the optimal global alignment. The NeedlemanWunsch algorithm, developed in 1970, works in this way. However, to do this we have to fill in the
entire grid, which is computationally intensive and therefore slow. It also seldom works well for
alignments across larger evolutionary distances because domains and motifs are shuffled.
1
2
3
4
5
23
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Optimal local alignment
The earliest and most rigorous local alignment method is the Smith–Waterman method. This sets all
negatively scoring cells in the grid to zero, making any positively scoring local alignments visible.
The method then traces back from the highest scoring cell in the grid until it reaches a score of zero,
the revealed local alignment is guaranteed to be the highest scoring local alignment between the
sequences for the scoring scheme used. As with the Needleman-Wunsch method, the entire grid
must be filled in, making this method very slow.
Substitution matrices
Sequence similarity searches of coding sequences have to take into account the fact that some
mismatches are more conservative than others. For example, if one positively charged amino acid
residue is substituted for another, protein function is less likely to be affected than if it’s changed to
a hydrophobic residue. A conservative change should therefore incur less of a penalty than a radical
change. A range of different substitution matrices has been developed for this purpose.
Substitution of one amino acid for another
with similar physicochemical properties is
usually not selected against and
represents a conservative change.
Substitution of each amino acid for every other one
can be given a score; conservative changes are not
penalised heavily (e.g. substitution of Y for X scores
–1 in this matrix) whereas a change to an amino acid
with significantly different properties (e.g. F to P) is
penalised more.
The first of these, developed by Margaret Dayhoff and colleagues in 1978, is the Point Accepted
Mutation (PAM) matrix. Based on the occurrence of observed amino acid replacements between
closely related (>74% identity) protein sequences and normalising for divergence, Dayhoff et al.
produced a matrix of expected substitution probabilities for proteins that are 1% diverged (the
PAM1 matrix). Several versions of the PAM matrix have been derived by extrapolation from the
24
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
PAM1 matrix, for example PAM40 (the expected substitution probabilities for 40% diverged
proteins) and PAM250 (250% diverged). Because the PAM matrices were derived from alignments
of closely related sequences, there are biases favouring the substitutions of amino acids that can be
achieved through single nucleotide changes in the genetic (codon) code. These biases are not
relevant over larger distances because there is plenty of time for multiple nucleotide changes, so
PAM matrices are more appropriate for searching or aligning more closely related sequences.
BLOSUM (Blocks of Amino Acid Substitution Matrix) developed by Henikoff and Henikoff in 1992,
uses frequency tables of substitutions observed in multiple alignments. BLOSUM50 scores
according to an alignment of proteins with 50% overall identity; BLOSUM62 uses an alignment of
proteins with 62% identity. In contrast to the PAM matrices, larger BLOSUM numbers therefore
represent smaller evolutionary distances.
25
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Using alignment tools at the EBI
Let’s have a go using these algorithms for ourselves.
Instruction:
Navigate to the EBI pairwise sequence alignment page
You can either type in an address directly (www.ebi.ac.uk/Tools/psa/) or use the Tools drop down
menu.
You will be faced with a choice of programs, split into global or local alignments.
26
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Once you chose a program, the input page looks like this:
Input for pairwise alignments is very simple
Step 1 – input
Here you can enter your two sequences. Most common formats are recognised, but please don’t try
to invent your own using Word! You can either paste/type the sequence directly, or upload a file
containing it from your computer using the browse button.
Step 2 –parameters
The program will be set up with some default parameters, however you can change them if you
wish.
You can click on any parameter title to get help on it. The main options are:
27
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Matrix
The comparison matrix to be used to score alignments when searching
the database
Gap open
Penalty to start a gap in the alignment
Gap extend
Penalty for each base or residue in the gap
Step 3 – submit
When you’re happy with everything else, select submit to run the job.
28
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Results
Results page
The results page is fairly simple. At the top of the page are some tabs that switch between the
alignment, submission details, and the form to submit a new job. parameters used for the job are
shown, and there is a link to the output file. The output file text is also displayed further down the
page.
Alignment
This page shows the results of the alignment. The table gives a summary of the results:
Length reports the length of alignment.
Identity reports the number of identical residues that are found aligned between the two
sequences.
Similarity reports the number of aligned residues that score positively in the substitution matrix (ie
similar types of residues).
Gaps reports the number of gaps inserted into the alignment.
29
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Score gives the literal score of the alignment as worked out by the algorithm.
Then the alignment itself is displayed:
The top line is the first sequence and the bottom is the second sequence. Gaps in sequence are
displayed with a ‘–‘ character. Where two identical residues line up they are connected with a ‘|’
where two very similar residues line up they are connected with a ‘:’. Where less well conserved
substitutions are made they are connected with a ‘.’.
Submission details
This page gives details of how the program was run. It tells you what version of the tool was run,
when it was launched, details of your input and the tool output, as well as the original command line
used to launch the job and details of the selected parameters. These can all be useful if you need to
recreate a job on your local machine, or to repeat an alignment in the future. This page is also very
useful to us if you have a problem and need to contact us.
Instruction:
Try running your own global alignment using the provided sequences:


Use the EMBOSS needle program
Leave the parameters set at their defaults
30
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Question 7 (global):
What is the Length, Score, %identity, %similarity and %gaps of the alignment?
Length
Score
Identity%
Similarity%
Gaps%
Now let’s try a local alignment.
Instruction:
Try running your own local alignment using the provided sequences:


This time use the EMBOSS water program
Leave the parameters set at their defaults
Question 8 (local):
What is the Length, Score, %identity, %similarity and %gaps of the alignment?
Length
Score
Identity%
Similarity%
Gaps%
31
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Question 9:
In words, how would you describe the key differences between the global and local
alignment results? Can you think of ways to improve the alignment?
Now let’s try changing the matrix parameter.
Instruction:
Try running your own local alignment using the provided sequences:


Use the EMBOSS water program
Change the matrix to BLOSUM 40, and leave the other parameters at their defaults
Question 10:
What is the Length, Score, %identity, %similarity and %gaps of our new alignment?
Can you describe what has happened?
Length
Score
Identity%
Similarity%
Gaps%
32
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Alignment against databases
While it’s possible (and very accurate) to run optimal alignments against databases (using SSearch or
GGsearch at the EBI for example), the computational requirements are such that it takes a very long
time, and uses a large amount of memory. It is more practical to use a heuristics based method such
as BLAST or FASTA.
Implementing the Methods for Sequence Searching Tools: BLAST
BLAST, which stands for Basic local alignment tool and was developed by Altschul and colleagues in
1990. BLAST uses an approximation of the Smith–Waterman algorithm which makes is quite fast,
however this gain in speed is offset by a decrease in accuracy. Unlike the true Smith–Waterman
algorithm, BLAST is not guaranteed to find the optimal alignment between your query sequence
and the test sequences. However, it will find good alignments and provides a statistical means of
gauging your confidence in each alignment:
(1) It searches for ‘words’ of a user-defined
length (the shorter the word, the more
sensitive the search).
(2) It then extends these words in both
directions until it finds a mismatch.
(3) It then performs an approximation of the Smith–Waterman algorithm to create a gapped
alignment between the query sequence and the test sequence.
(4) Finally it calculates and reports the
probability of the alignment occurring by chance.
33
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
When doing a Blast search one can set the
“expectation threshold” (EXP THR) which
establishes a statistical significance threshold for
reporting database sequence matches. EXP THR
is interpreted as the upper limit on the expected
frequency of a chance occurrence of a match
within the context of the entire database search;
in other words, it sets an upper limit on the E-value. Any database sequence whose BLAST
alignment to the query sequence satisfies EXP THR is reported in the output file. An alignment with
an E value of ≥1.0 is expected to be found at least once by chance in the searched database and an E
value of ≥5.0 is expected to be found at least five times (see figure below). Raising this threshold
increases the likelihood of reporting distantly related matches, but the frequency of chance
matches reported will tend to grow at a much faster rate than real matches with EXP THR set above
1.0.
34
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Implementing the Methods for Sequence Searching Tools: FASTA
David Lipman and William Pearson (1988) developed FASTA which gets around the speed problem
by:
(1) FASTA breaks the query and test sequences into overlapping words and looks for exact matches.
(2) Then it re-scores these matches using a substitution matrix.
(3) Next it tries to join the highest-scoring segments. A ‘joining threshold’ set by the user eliminates
segments that are unlikely to be part of the alignment.
(4) Finally, FASTA uses the Smith–Waterman method to optimise the alignment, using only the part
of the matrix that contains the top-scoring segments.
BLAST & FASTA Sensitivity
35
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
FASTA therefore provides a means of performing a sensitive search against a large database in a
reasonable time. Nowadays, it is possible to approach the sensitivity of a FASTA search by using
BLAST and setting high sensitivity values. However, the
Sensitivity: There is a trade-off between
alignment statistics for FASTA can be considered more
sensitivity and search speed. Increasing the
robust than those for gapped-BLAST. This is because
sensitivity
makes
the
search
more
FASTA produces and scores an alignment of the query
computationally intensive and therefore slower.
sequence with a large sampling of the database, giving it
Decreasing the sensitivity, for example when
a distribution of scores that represents the entire
you are looking for almost exact matches, can
database:sequence range of alignments. BLAST is fast
dramatically increase the search speed. There is
because it does not bother producing an alignment for
also a trade-off between sensitivity and
most database sequences; alignments are only
specificity: increasing sensitivity tends to
triggered if the initial word-match criteria are met.
decrease specificity (greater propensity for
Consequently, BLAST does not have a complete
chance matches). So for example: if you are
distribution of alignment scores over the database from
looking for vector contamination you will
which to calculate the significance of the reported
choose low sensitivity, whereas if you are
matches. Instead, BLAST uses pre-computed values for
looking for long distance related sequence you
will opt for high sensitivity.
the score distribution rather than calculating values that
are specific to the search carried out.
36
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Sequence Searching Similarity tools at the EBI
The EBI provides you the option to use several sequence searching similarity tools at
www.ebi.ac.uk/Tools/sss. These are maintained by the External Service (ES) team. The ES team puts
considerable effort in making sure to provide you with the state of the art tools in sequence
searching and also with the flexibility to tailor your queries to the most appropriate search
parameters. Therefore you will find several options not only for the types of tools but also within the
tools, the amount of variable and parameters you can change.
37
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
We have talked about BLAST and FASTA but what are all the variations above? A quick way to make
a distinction among these tools is to label them by being either heuristic or rigorous. A fundamental
challenge in computer science is to make algorithms that find verifiable good solutions using a
proved bounded amount of computation time. A heuristic algorithm gives up one or both of these
goals. In other words heuristic is an algorithm that is able to produce an acceptable solution to a
problem in many practical scenarios, but for which there is no formal proof. Heuristics are typically
used when there is no known method to find an optimal solution, under the given constraints (of
time, space etc.) or at all. Rigorous on the other hands means to applying an algorithm that can
produce proof, that gives you the most optimal solution, therefore it is exhaustive and should
provide you with the best answer, however it is slow.
Following these definitions, the Sequence Searching Tools mentioned above are label as shown on
this panel:
38
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Using FASTA
Let’s have a go using FASTA for ourselves.
Instruction:
Navigate to the EBI sequence search tools page
You can either type in an address directly (www.ebi.ac.uk/Tools/sss/) or use the Tools drop down
menu. At the time of writing the tools drop down points to our old
The framework from
framework while the address goes to the new one. You can click the link to
which we launch our
tools was revamped
access the new one, which we will be using in this tutorial.
FASTA is launched by picking a type of database listed next to the FASTA
tool – this sets up the defaults appropriately for your choice, but you can
always change the database once in the tool if you wish.
Instruction:
recently, and now forms
a common basis for
many programs. For
more information about
this framework see:
A new bioinformatics
analysis tools framework
at EMBL–EBI (2010)
Goujon et al.
doi:10.1093/nar/gkq313
Select the Protein database for FASTA
You should end up at a screen that looks like the following:
39
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
This screen will hopefully look quite familiar to you when conducting other searches as well. There
are four steps to submitting a job.
Step 1 – database selection
Here is where you select which databases to search against. You can select more than one, and you
can also expand subsections to narrow your search by taxonomic division for example. If you
changed your mind and want to do a different type of search (for example, against nucleotide
sequences) then you can select that here as well and it will reset the form.
Step 2 – input
Here you can enter your query sequence. Most common formats are recognised, but please don’t
try to invent your own using Word! You can either paste/type the sequence directly, or upload a file
containing it from your computer using the browse button.
Step 3 –parameters
Most important here is choice of program – FASTA is grouped together with others like SSEARCH
(because they were authored by the same person and come in the same package). It should be set
up correctly according to the button you pressed to get to this page. The program will be set up with
40
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
some default parameters, however to look at these or change them you need to click the ‘More
options’ button.
You can click on any parameter title to get help on it. Here is a summary:
Matrix
The comparison matrix to be used to score alignments when searching
the database
Gap open
Penalty to start a gap in the alignment
Gap extend
Penalty for each base or residue in the gap
Ktup
‘Word size’ used to identify runs in the first stage of alignment
Expectation upper value
This allows you to ignore results that have above a certain expectation
score (ie become more distant)
Expectation lower value
This allows you to ignore results below a certain expectation (ie ignore
close relatives)
DNA strand
When searching DNA you can specify which strand is used. By default
both are searched
Histogram
Turn on/off the display of statistical histogram in FASTA results
Filter
Which low complexity filter to use
Statistical estimates
Which statistical method to use to evaluate values used in the Expect
score calculation
Scores
Maximum number of scores reported in the summary
Alignments
Maximum numbers of alignments reported in the summary
Sequence range
Allows you to specify which portion of the query sequence to use in
the search
Database range
Allows you to cut back on database sequences searched against by
specifying a number of residues range
Step 4 – submit
When you’re happy with everything else, here is where you click to submit the job. For longer runs it
is recommended that you tick the box to send you an email when the job is complete. This will
contain a link back to the results so there is no need to keep your browser open. Email jobs are
usually stored on the servers for longer as well, while interactive job results are deleted more
quickly.
41
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Once you’ve submitted your job the first thing that happens is that the input is validated – few
things are more frustrating than preparing a job and firing it off to then check your email later in the
day and find that you made a minor mistake somewhere and the job failed. If everything is okay then
the job will run and you will see a job running screen if you ran it interactively. Eventually the job will
finish and you will either be taken to the results page automatically or emailed a link to it.
Results
Summary Table
There is a lot of information to take in for the results page! You are first presented with the summary
table which quickly lists the top results in a table format. You can change the ordering of the table by
clicking the arrows next to each column header – by default they are ranked by Expect value or E().
The first column contains a tick box which allows you to select database results for further actions,
for example to view the sequence annotation or detailed alignment with the query sequence. You
can also use the buttons to clear selections, select all, or invert selection. To download the selected
sequences click the download button.
The second column (DB:ID) gives the database ID of the sequence.
The third column (Source) gives some quick information about the sequence, as well as cross
references that have been found referring to it in other resources across the EBI – so you can quickly
look up more information about the sequence in these resources.
42
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Length reports the length of the database sequence.
Score gives the literal score of the alignment.
Identities reports the number of identical residues that are found aligned between the query and
database sequence.
Positives reports the number of aligned residues that score positively in the substitution matrix (ie
similar types of residues).
E() gives the Expect score for the alignment – this is a measure of how likely you are to find that
alignment by chance. When the numbers are very small it reports them as 1.0E-10 for example. This
is the same as 1e-10, or 0.0000000001.
Tool Output
This tab switches the view to the raw, original output from FASTA – this can be useful when you
want to view the full text output from the program in case it contains something the summary or
other pages don’t cover. You can download it as a text or XML file.
Clicking on the
icon will jump you straight to the alignment. Clicking on the sequence ID will take
you to the original database entry for that sequence.
43
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Visual Output
This output gives you a nice way of visualising which portions of the sequence are aligning, as well as
colour-coding the alignment by Expect score.
Hovering your mouse over the sequence ID on the left-hand side will show a guide box around the
alignment. Clicking the mouse will take you to the alignment (from the raw output of the program).
44
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Functional Predictions
This tab is another example of how we are bringing together different resources at the EBI to give
you extra information. It searches a variety of resources for family and domain predictions and
shows you the results graphically, so you can easily see which portions of your sequence and
alignments correspond to these features.
Again, you can click on the links on the left-hand side to jump to the entry information at the
resource.
Submission Details
This page contains all the original parameters used to launch your job, together with easy links to
the exact input used and the output results.
This information is really useful for several reasons: If you want to repeat a job then you might want
to use exactly the same parameters; if you’re interested in running the command line version of the
tool then this will give you the exact command line used; and finally if you need help or support then
this page contains all of the information you need to give us to be able to help you quickly.
Submit Another Job
The final tab just quickly takes you back to the page where you can start a new FASTA job.
45
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Interpreting an alignment
The figure below shows a typical alignment
Key
-
Gap
:
Identity
.
Similarity
X
Filtered
The header shows some information about the database sequence, followed by some of the raw
scores from the program itself. The key bits of information are the E() score, the % identity and %
similarity numbers.
Below that is the actual alignment itself – the top line is the query sequence and the bottom is the
database sequence. Gaps in sequence are displayed with a ‘–‘ character (none in the above
alignment). Where two identical residues line up they are connected with a ‘:’ where two similar
residues line up they are connected with a ‘.’
Instruction:
Try running your own FASTA search using the provided sequence:



Search against the UniProtKB/Swiss-Prot database only – if you search against the
full UniProt Knowledgebase it will take a very long time!
Make sure the FASTA program is selected
Leave the parameters set at their defaults
46
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Question 11:
What are the default gap open, gap extend, ktup and matrix parameter settings for this
search?
Question 12:
What can you say about the likely function of this protein?
Question 13:
What are the DB:IDs, Scores, %identities, %positives and E() of the top two results?
ID
Score
Identity%
Positives%
E()
47
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Question 14:
Have a look at the actual alignments with the top two results – what can you say about
them?
48
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Using BLAST
Now that we are familiar with running FASTA searching, using BLAST should be very easy – the
interface is effectively the same. Simply select a database next to the BLAST tool of interest to enter
the tool.
NCBI BLAST
This version of BLAST is the version maintained at the NCBI
WU-BLAST
This version of BLAST was created by Dr Warren Gish, formerly of Washington University
Both versions can trace their history back to the same algorithm, but were developed separately,
often implementing ideas that first appeared in the other version.
The parameters are handled slightly differently in each case: NCBI BLAST provides access to
parameters like those used by FASTA or SSEARCH (eg gap open, gap extend). WU-BLAST hides direct
access to those, but instead provides a sensitivity parameter which combines several adjustments at
different stages of the algorithm. The raw results are slightly different between the two, but as we
parse the results they will appear the same to you (unless you look at the Tool Output page).
NCBI BLAST parameters
Matrix
The comparison matrix to be used to score alignments when searching
the database
Gap open
Penalty to start a gap in the alignment
Gap extend
Penalty for each base or residue in the gap
Exp. Thr (expectation
threshold)
This allows you to ignore results that have above a certain expectation
score (ie become more distant)
Filter
Which low complexity filter to use
Drop off
Controls how far a potential HSP is allowed to extend
Scores
Maximum number of scores reported in the summary
Alignments
Maximum numbers of alignments reported in the summary
Sequence range
Allows you to specify which portion of the query sequence to use in
the search
Gap align
Allows gapped extensions of alignments
Alignment views
Options for formatting the alignment output (Tool Output)
49
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
WU-BLAST parameters
Matrix
The comparison matrix to be used to score alignments when searching
the database
Exp. Thr (expectation
threshold)
This allows you to ignore results that have above a certain expectation
score (ie become more distant)
Filter
Which low complexity filter to use
View filter
Display any sequence filtered out (in Tool Output)
Sensitivity
General parameter affecting search sensitivity – this makes
adjustments to several internal parameters
Scores
Maximum number of scores reported in the summary
Alignments
Maximum numbers of alignments reported in the summary
Sort
Choose which value to sort the Tool Output results by.
Stats
Choice of statistic methods used in generation of Expect statistics
topcomboN
In WU-BLAST HSPs are classified into a number of sets, you can use this
parameter to restrict the display to only the N highest scoring sets.
Alignment views
Options for formatting the alignment output (Tool Output)
BLAST results
Unsurprisingly, the results from BLAST runs on our servers are displayed in the same way as FASTA
results, and all the tabs are equivalent.
The main differences in format will only appear if you look at the raw output via the Tool Output
page.
50
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Key
-
Gap
[residue] Identity
+
Similarity
X
Filtered
The above picture shows the alignment section of an NCBI BLAST run. This time identical residues
are actually listed between the two sequences, which are labelled Query for query sequence and
Sbjct (subject) for database sequence. Similar residues (Positives) are indicated with a +. The number
of gaps inserted into the overlapping sequence regions is also reported.
Instruction:
Try running your own NCBI BLAST search using the provided sequence:



Search against the UniProtKB/Swiss-Prot database only – if you search against the
full UniProt Knowledgebase it will take a very long time!
Make sure the BLASTP program is selected
Leave the parameters set at their defaults
51
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Question 15:
What are the default gap open, gap extend, drop off and matrix parameter settings for
this search?
Question 16:
What are the DB:IDs, Scores, %identities, %positives and E() of the top two results?
ID
Score
Identity%
Positives%
E()
Question 17:
Are these different from our FASTA search earlier?
52
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Differences between BLAST and FASTA
The following table summarises the key differences between BLAST and FASTA
BLAST
FASTA
Fast
Good with proteins
Might miss potential alignment
Not as fast as BLAST
Good with proteins and DNA
Aligns against all database
sequences
Produces S&W alignments
Good for cousins
Produces HSPs
Good for siblings
When to use what?
In general the larger the database the faster the algorithm you should use, and likewise the larger
the query sequence the faster the algorithm you should use. For very small queries or databases
then dynamic programming methods like SSEARCH can be great.
53
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
PSI-BLAST
Position-Specific Iterative BLAST, or PSI-BLAST, is a clever tool which allows you to create your own
custom scoring matrix based on the conservation of residues you find in your own searches, rather
than some model made with different sequences.
PSI-BLAST workflow
It starts with a normal BLAST, however you can then
1. Normal BLAST search
select which sequences in the results will be used to
build a profile. These sequences are then aligned and
conserved residues at each position are scored more
highly in a new type of matrix which allows for
different scores at different positions in the
sequence.
A new BLAST search is run with this matrix, called a
Position Specific Scoring Matrix or PSSM. The results
can themselves be used to create another PSSM for
another run, and so the process is iterative.
The aim of PSI-BLAST is concentrate the alignment on
positions that are important, while allowing for more
variability in areas that aren’t so important. So a
functional area or binding motive might be more
important than sequence that forms part of a loop
for example.
2. Align selected results
Searches made with a PSSM can find matches with
sequences that were scored too low to be
considered in a normal BLAST search, but have
scored more highly with the new matrix – these are
marked as ‘new’ by the PSI-BLAST tool.
3. Create PSSM
You can also save your search and continue it at a
later date, or save the PSSM itself.
(Continued on next page)
54
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
PSI-BLAST workflow (cont.)
The parameters for PSI-BLAST are the same as
NCBI BLAST, with the addition of a new
threshold:
4. Use PSSM to score new BLAST
alignment
PSI-BLAST Threshold
This expectation value controls the default
selection of sequences to be used for generation
of the PSSM – sequences scoring higher than this
(ie don’t align as well) won’t be included.
Once the first iteration is run, additional controls
over a normal BLAST result appear – the PSIBLAST Threshold can be changed again, or
individual sequences can be added or removed
from the selection by ticking the box in the first
Summary Table column.
Go to Step 2 if required
The View Threshold limit button jumps the view down the table so you can see the cut off.
55
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Controls to download a checkpoint file or PSSM only appear after the second iteration.
Instruction:
Try running your own PSI-BLAST search using the provided sequence:




Search against the UniProtKB/Swiss-Prot database only – if you search against
the full UniProt Knowledgebase it will take a very long time!
Make sure the PSI-BLAST program is selected
Leave the parameters set at their defaults
For the moment, stop after the first run
56
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Question 18:
Looking at the first run (normal BLAST results), how many sequences score above our
default threshold of 1.0e-3?
Instruction:
For the second iteration you can choose which sequences to include in the PSSM
generation.



Untick the top scoring sequence (simply because it scores so much better than the
other results – you wouldn’t necessarily normally do this)
Leave everything else set to defaults
Click the ‘run next iteration’ button
Question 19:
Looking at the second iteration, how many sequences now score above our default
threshold of 1.0e-3? (Hint: use the View Threshold Limit button). What is a likely explanation?
Question 20:
Have any new sequences been scored well enough to appear in our results?
57
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Filters
Filtering can eliminate statistically significant but biologically uninteresting reports from the BLAST
output by masking out various segments of the query sequence for regions which are non-specific
for sequence similarity searches. This leaves the more biologically interesting regions of the query
sequence available for specific matching against database sequences. For example, it may be desired
to mask acidic, basic or proline-rich segments of a protein that would otherwise yield overwhelming
amounts of uninteresting, non-specific matches against a wide array of protein families. The SEG
program (Wootton and Federhen, 1993) masks low compositional complexity regions, while XNU
(Claverie and States, 1993) masks regions containing short-periodicity internal repeats. SEG+XNU
will combine the above two. The DUST program by Tatusov and Lipman can only be used with DNA
searches and will mask simple repeats in DNA/RNA sequences.
Instruction:
Perform a FASTA search against the UniProtKB/Swiss-Prot database for the filtertest
sequence that the demonstrator has provided.



Make sure to select more options and set the Histogram display to YES
You could also change the Expectation upper value to 0.001 to help make the results
clearer
To see the histogram, go to the Tool Output tab in the results
Question 21:
Describe how the observed vs expected histogram looks? What does this mean? How
many results have an alignment with an expect score better than 0.001?
Instruction:
Repeat the search, but this time use the SEG filter from the more options parameters.

Make sure that the Histogram display is still set to YES, and expectation value to
0.001 if you want to clearly compare.
58
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Question 22:
Now how does the observed vs expected histogram look? How many results have an
alignment with an expect score better than 0.001?
Homologous Over-Extension (HOE)
Iterative search strategies using profiles (ie the PSSM in PSI-BLAST) might help increase the
sensitivity of a search, however while the aim is to have the profile reflect areas of interest (a
domain for example) there is a danger that it will be contaminated with information that is not
relevant to your query. Low complexity regions are one example of this, but these can be fixed with
the use of filters. Another cause of contamination that was recently described is Homologous OverExtension (HOE).
HOE can occur in a profile based alignment when the alignment region picks up a portion of
sequence that is not biologically relevant to our query but that is conserved in other sequences
brought back by the search. The influence of this region on the scoring matrix can be such that the
alignment region extends even further beyond the domain of interest. This can even begin to cover a
domain that is not present in the query sequence, once this happens the weighting of the scoring
matrix can influence the alignment so much that sequences not at all biologically related to the
query start to be found as significant, resulting in an increase in false-positives.
Ideally this is prevented by careful selection of which sequences to include in the generation of the
PSSM and making sure that they do not have other domains near the boundaries of the alignment
that might cause alignment extension – our functional prediction page might help with this. But as
this is a manual method, and domain information might not be present in the functional predictions,
we have created a method to automatically reduce the likelihood of HOE occurring by masking
boundaries at the edge of the original alignment. At the moment this method only applies to PSISearch – a tool that combines sensitive Smith-Waterman based local alignment with the PSI-BLAST
profile construction strategy, but it can be enabled by toggling the option ‘HOE region masking’ to
yes (which is the default setting).
59
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Vector Contamination
Another reason you might not get the results you are expecting is due to vector contamination – a
common problem if your sequence is fresh from the sequencer.
One way to check for this problem if you suspect something is to search against a specialist dataset
containing vector sequences only – at the EBI the EMVEC database does exactly this, and there is an
NCBI mode to perform this role.
60
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Question 23:
A student has given you two sequences and they have forgotten whether they have
already trimmed them for vector contamination. Use the BLAST tools at EBI to determine
whether they have vector contaminants or not.
Sequence 1:
Sequence 2:
61
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Multiple Sequence Alignment
We’ve already seen how we can apply rigorous algorithms to align a pair of sequences, but what
happens when you want to align more than two sequences? This is where multiple sequence
alignment (MSA) comes in.
Ideally, a multiple alignment would carry out rigorous alignments between every possible
combination of sequences, and then use this
Sequences
Time
information to optimise a final alignment between
all of the sequences. Unfortunately this Weighted
2
1 second
Sum of Pairs method is incredibly computationally
demanding!
3
150 seconds
As a result, we have to use heuristics again to bring
down alignment times to something that is viable.
4
6.25 hours
One method is called progressive alignment. Here
subsets of the alignments are carried out and then
fixed, to which further alignments take place. This
builds up the multiple alignment in a tree fashion.
5
39 days
6
16 years
62
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
ClustalW
ClustalW is an example of a progressive/tree based multiple sequence alignment. It performs a quick
pairwise alignment of the sequences before fixing the highest scoring aligned pair and treating them
as a single sequence. Other sequences are then aligned to this and fixed in turn, building up a
progressive alignment.
A guide tree is the term used for the
tree created as part of a progressive
alignment process, and is used to help
order and arrange sequences to be
added to a multiple sequence
alignment. This is not a phylogenetic
tree! A very common mistake is for
people to use a guide tree as a
phylogenetic tree.
This works very quickly, but has some drawbacks, especially if
the highest scoring pair are badly aligned, as this alignment error
will propagate through the rest of the alignment.
ClustalW at the EBI can be found in the sequence analysis section of the tools drop-down menu.
Instruction:
Navigate to the ClustalW page
You should see something like the following:
63
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
The layout is similar to the pariwise alignment tools we looked at earlier, with an input section and a
parameters section.
As usual, information about each parameter can be found by clicking the links above each
parameter. Key options are described below.
Alignment Type
ClustalW has the option of performing ‘slow’ or ‘fast’ initial alignments
– slow is already quite quick so choose this in most cases.
Matrix
The comparison matrix to be used to score alignments
Gap open
Penalty for the first residue in a gap
No end gaps
Exclude end gaps
Gap extension
Penalty for each additional residue in a gap
Gap distances
ClustalW has an additional gap separation penalty
No end gaps
When set to no this ignores the gap separation penalty at the ends of
the alignment
64
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Iteration
Iteration type
Numiter
Maximum number of iterations to run
Clustering
Neighbour Joining is the default clustering option, but UPGMA is
available which might help with very large numbers of sequences
Output formats
What format you want the output file to be in
Output order
Here you can choose to keep the original input order or to order by
alignment
65
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Results
When your job has finished, you should see the following:
Like other tools in our framework, there are several tabs to switch between different results pages.
Alignments
This page shows the alignment, along with a button to download/show the alignment text file, and
another button to colour the sequences according to their physico-chemical properties:
Colour
Property
Residues
Red
Small (small+ hydrophobic (incl.aromatic -Y))
AVFPMILW
Blue
Acidic
DE
Magenta
Basic - H
RK
Green
Hydroxyl + sulfhydryl + amine + G
STYHCNGQ
Grey
Unusual amino/imino acids etc
Others
66
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
The key is similar to that which we’ve seen before for other alignment results, however all the
sequences are lined up together vertically, and consensus symbols are displayed at the bottom of
the columns . Gaps in sequence are displayed with a ‘–‘ character. Where all sequences have the
same residue in a column a ‘*’ character is displayed beneath the column. Where similar residues
line up there is a ‘:’ character. Where less well conserved substitutions are made they are marked
with a ‘.’.
Result Summary
The result summary page lists the files that the program produces and displays the scores table from
ClustalW, which lists the alignment scores for each pair of sequences used to make up the multiple
alignment. There is a button to launch Jalview as well.
Jalview
Jalview is a standalone multiple sequence alignment viewer that allows for more useful viewing than
simply looking at the text output of a ClustalW alignment. It can be downloaded and run on its own,
however at the EBI we have incorporated an applet version of it into the website, so all you need to
do is click the Jalview button on the results page and, assuming java is set up correctly on your
machine, Jalview should eventually load with your multiple sequence alignment all ready for
viewing.
67
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Jalview is quite a powerful tool, with more options than we can go into in this document, however
full documentation can be found at the Jalview homepage:
http://www.jalview.org/
The graphs under the alignment represent various properties:
Conservation measures the number of conserved physic-chemical properties for each column
Quality measures the likelihood of observing mutations in a particular column – a high score
suggests there are no mutations, or that mutations found are favourable as given by the BLOSUM 62
matrix
Consensus shows the most common residue per column and the percentage of alignments that
contain this residue, by default gaps are included in this calculation.
Guide Tree
The next tab is the guide tree.
68
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
This tab displays any guide trees produced by the MSA tool. Please not that this is NOT a
phylogenetic tree! You can download the data for the tree via the button here, or from the link in
the Result Summary tab.
69
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Submission Details
This tab contains all the information about how your job was submitted to the servers, including the
command line run and all the parameters. This is very useful if you are wanting to replicate a job on
a local machine, and the information on this page is also useful to us in Support if you need help with
a problem.
Submit Another Job
This isn’t really a tab, but takes you back to the start of the form so you can submit a new job.
Instruction:
Try running your own ClustalW alignment using the provided sequences:



You can use email, but the job should be quick enough to run interactively
Leave the parameters set at their defaults
Turning on colours may make it easier to see regions of similar properties, or you
can use Jalview to display the alignment
70
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Question 24:
This example includes the two sequences we tried to align earlier in the roadshow.
Does the multiple alignment give any insight into the result we achieved before?
So we’ve tried a fairly simple (and small) multiple sequence alignment. The next few alignments with
ClustalW replicate the errors that people ask us for help with, so you know what to do if you see
them!
Question 25:

Perform a ClustalW alignment using the file ‘Problem_MSA1.fsa’
What is the error message shown?
What is wrong with our input that caused this error?
Question 26:

Perform a ClustalW alignment using the file ‘Problem_MSA2.fsa’
What is the error message shown?
What is wrong with our input that caused this error?
71
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Question 27:

Perform a ClustalW alignment using the file ‘Problem_MSA3.fsa’
What is the error message shown?
What is wrong with our input that caused this error?
Question 28:

Perform a ClustalW alignment using the file ‘Problem_MSA4.fsa’
What is the error message shown?
What is wrong with our input that caused this error?
72
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Other Multiple Sequence Alignment Tools
While ClustalW is by far the most popular MSA tool currently, it has a number of drawbacks, notably
in the danger of propagating errors from the initial alignments throughout the whole alignment or
the way it deals with unusual alignments. Nonetheless its speed and wide usage make it a useful
tool.
Clustal Omega
Clustal Omega is the latest tool from the Clustal authors. It uses a number of new techniques to
significantly improve alignments over ClustalW including seeded guide tree generation and HMMHMM alignments.
Other MSA tools are available at the EBI to enable you to perform more accurate alignments or at
least to compare the results between different alignments. Some of these are mentioned below:
T-Coffee
T-Coffee is a tree-based variant of the COFFEE tool, which aims to keep some of the accuracy while
enabling it to be run on a viable timescale. It still has some high demands on computer hardware
however, and large jobs can take a very long time to run!
MUSCLE
MUSCLE is another progressive alignment tool, but goes about things in a much cleverer way than
ClustalW with the result that accuracy is claimed to be higher for the same or better speed.
MAFFT
MAFFT uses Fast Fourier Transforms to perform accurate and fast alignments.
Kalign
Kalign uses an approximate string matching algorithm to estimate sequence distances very rapidly,
concentrating on local regions rather than globally aligning, and is the fastest algorithm we offer for
large numbers of sequences.
The tools can all be found either from the drop down menus or from the Multiple Sequence
Alignments page at the EBI (www.ebi.ac.uk/Tools/msa):
73
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
The way the tools are used is very similar to the way ClustalW is used.
You should be able to launch Jalview for most of our MSA tools, however if the option is missing you
can launch Jalview from another tool and paste the alignment from your tool into it, to view the
alignment.
74
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Instruction:
Try running your own alignments using the provided sequences and trying out
different MSA tools:





Choose from any or all of Clustal Omega, MUSCLE, MAFFT, T-Coffee, Kalign
See if you can tell any difference in running speed (you might not be able to –
this is a very short alignment)
Compare the alignment results with other tools including ClustalW
You might find it easier to use Jalview to compare several alignments
You can cut/paste sequences in Jalview to re-order them
Question 29:
Note any comments you have about the different alignment tools:
75
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
WebPRANK
So far the tools we’ve looked at for MSA tend towards roughly similar behaviours. It has been
speculated that this might be because they are usually benchmarked against only a few specific
datasets, and thus they tend to be optimised towards high scoring results for those tests. Also they
tend to favour multiple independent deletions over insertions, leading to sequences that shrink in
length over evolution, which isn’t a view backed up by evolutionary evidence.
PRANK is a tool which tries to address these shortcomings by using phylogenetic information to keep
track of deletions as they occur through the sequence evolution.
PRANK was developed by the Goldman group (and Ari Löytynoja in particular) at the EBI, so has its
own page as part of the Goldman research group
http://www.ebi.ac.uk/goldman/.
pages,
which
can be found at
Instruction:
Navigate to http://www.ebi.ac.uk/goldman-srv/webprank/
76
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
As you can see, it looks a little different from our general sequence analysis tools. It should open with
the Sequence input and alignment section open, and it is here that you can paste or upload your
sequence. This is also where the Start alignment button is located.
To access the options for changing parameters, you have to click on the links below the input section.
The previously open section will contract and the new section will open and allow you to view or make
changes.
You can also retrieve previously submitted jobs, or use the wePRANK tools to view alignments from
another source (or that were previously saved).
Results
Once the job has run you have several options. You can open the results in the browser, open them
in the webPRANK viewer, or download the results. The job-ID is also listed should you want to note it
down to retrieve at another date.
The webPRANK viewer allows you to view the alignment interactively, as well as the phylogentic
information that has helped inform the alignment. There is also a reliability score which allows you
to remove sites with lower reliability, either based on the currently selected node or on the lowest
score. This will mask portions of the alignment, allowing you to export just the higher reliability
sections.
77
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Instruction:
Try running your own webPRANK alignment using the provided sequences.




Input the sequences in the top section
Have a look at the different options available in the other sections
But keep to defaults for this run
When the job is finished you can view the results with several methods, try the
webPRANK viewer
Question 30:
How does the alignment in webPRANK compare with ClustalW? How long is the
alignment? What are the likely reasons for this?
78
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Getting HELP





Read the database Documentation
Frequently Asked Questions: http://www.ebi.ac.uk/help/faq.html
2can Support Portal: http://www.ebi.ac.uk/2can/
EBI Support: http://www.ebi.ac.uk/support/
Hands-on training programme: http://www.ebi.ac.uk/training/handson/
Related articles from the EBI
A new bioinformatics analysis tools framework at EMBL-EBI
[http://dx.doi.org/10.1093/nar/gkq313]
The European Bioinformatics Institute’s data resources
[http://dx.doi.org/10.1093/nar/gkp986]
Web services at the European Bioinformatics Institute-2009
[http://dx.doi.org/10.1093/nar/gkp302]
The Universal Protein Resource (UniProt) in 2010
[http://dx.doi.org/10.1093/nar/gkp846]
The IntAct molecular interaction database in 2010
[http://dx.doi.org/10.1093/nar/gkp878]
The Gene Ontology in 2010: extensions and refinements
[http://dx.doi.org/10.1093/nar/gkp1018]
The Proteomics Identifications database: 2010 update
[http://dx.doi.org/10.1093/nar/gkp964]
InterPro: the integrative protein signature database
[http://dx.doi.org/10.1093/nar/gkn785]
Reactome knowledgebase of human biological pathways and processes
[http://dx.doi.org/10.1093/nar/gkn863]
Further reading











Needleman, S. B. and Wunsch, C. D. (1970) A general method applicable to the search for similarities
in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453
Smith, T. F and Waterman, M. S. (1981) Identification of common molecular subsequences. J. Mol.
Biol. 147, 195–197
Pearson, W. R. and Lipman, D. J. (1988) Improved tools for biological sequence comparison. Proc. Natl
Acad. Sci. U. S. A. 85, 2444–2448
Ning, Z., Cox, A. J. and Mullikin, J. C. (2001) SSAHA: a fast search method for large DNA databases.
Genome Res. 11, 1725–1729
Kent, J. (2002) BLAT – the BLAST-like alignment tool. Genome Res. 12, 656–664
Dayhoff, M. O., Schwartz, R. M. and Orcutt, B. C. (1978) A model for evolutionary change in proteins.
in Atlas of Protein Sequence and Structure, (Ed. Dayhoff, M. O.) Vol. 5, pp. 345–352 (National
Biochemical Research Foundation)
Henikoff, S. and Henikoff, J. G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl
Acad. Sci. U. S. A. 89, 10915–10919
Altschul,S.F., Warren,G., Webb,M., Eugene,W.M. and Lipman,D.J. (1990) Basic local alignment search
tool. J. Mol. Biol. 215:403–410
Altschul,S.F., Madden,T.L., Schäffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped
BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res.
25(17):3389-402.
Lopez,R., Silventoinen,V., Robinson,S., Kibria,A. and Gish,W. (2003) WU-Blast2 server at the European
Bioinformatics Institute. Nucleic Acids Res. 31(13):3795-8 .
Mackey,A.J., Haystead,T.A. and Pearson,W.R. (2002) Getting more from less: algorithms for rapid
protein identification with multiple short peptide sequences. Molecular and Cellular Proteomics
1(2):139-147.
79
Sequence Searching and Alignments
EBI Bioinformatics Roadshow




Brown,N.P., Leroy,C. and Sander,C. (1998) MView: a web-compatible database search or multiple
alignment viewer. Bioinformatics 14(4):380-381.
Thompson,J.D., Plewniak,F., Thierry,J.C. and Poch,O. (2000) DbClustal: rapid and reliable global
multiple alignments of protein sequences detected by database searches. Nucleic Acids Res.
28(15):2919-2926.
Mickael Goujon, Hamish McWilliam, Weizhong Li, Franck Valentin, Silvano Squizzato, Juri Paern and
Rodrigo Lopez (2010) A new bioinformatics analysis tools framework at EMBL–EBI. Nucleic Acids Res.
31(13):3795-8 .
Mileidy W. Gonzalez and William R. Pearson (2010) Homologous over-extension: a challenge for
iterative similarity searches. Nucleic Acids Res. 2010 April; 38(7): 2177–2189
Appendix
Nucleotide codes
Code
Meaning
Etymology
Complement
Opposite
A
A
Adenosine
T
B
T/U
T
Thymidin/Uridine
A
V
G
G
Guanine
C
H
C
C
Cytidine
G
D
K
G or T
Keto
M
M
M
A or C
Amino
K
K
R
A or G
Purine
Y
Y
Y
C or T
Pyrimidine
R
R
S
C or G
Strong
S
W
W
A or T
Weak
W
S
B
T or G or C
not A (B is next)
V
A
V
A or G or C
not T/U (V is next)
B
T/U
H
A or T or C
not G (H is next)
D
G
D
A or T or G
not C (D is next)
H
C
X/N
A or T or G or C
any
N
.
.
not A or T or G or C
.
N
80
Sequence Searching and Alignments
EBI Bioinformatics Roadshow
Amino acid codes
Single letter code
3-letter code
Name
A
Ala
Alanine
C
Cys
Cysteine
G
Gly
Glycine
H
His
Histidine
I
Ile
Isoleucine
L
Leu
Leucine
M
Met
Methionine
P
Pro
Proline
S
Ser
Serine
T
Thr
Threonine
V
Val
Valine
F
Phe
Phenylanine
N
Asn
Asparagine
R
Arg
Arginine
Y
Tyr
Tyrosine
D
Asp
Aspartic acid
E
Glu
Glutamic acid
K
Lys
Lysine
Q
Gln
Glutamine
W
Trp
Tryptophan
B
Asx
Asp or Asn
Z
Glx
Glu or Gln
X
Any
81
Sequence Searching and Alignments
Download