Multiple Sequence Alignment

advertisement
Module 6
Multiple Sequence Alignment
Aims

Objectives

Introduction
The result of searching databases using the techniques described in Module 4 is the establishment
of a list of sequences, either protein or nucleotide, which exhibit significant similarity and are
therefore inferred to be homologous. These sequences can then be subjected to multiple sequence
alignment a process that involves an attempt to place residues in columns that derive from a
common ancestral residue by substitutions. Gaps can be inserted to represent residues lost or
gained by insertions and deletions (indels). The most successful alignment is the one that most
closely represents the evolutionary history of the sequences.
Why should one want to carry out this form of analysis. The most obvious reason is to attempt a
phylogenetic analysis of the sequences so as to construct evolutionary trees (a major area of
Bioinformatics, but one which would require a course just to itself). However, there are other
very important reasons for carrying out multiple sequence alignments of related sequences which
include the identification of functional sites, the identication of modules in multimodular protein
(see Module 5), protein structure prediction (see Module 7), the detection of weak similarities in
databases using profiles (see Psi-blast) and the design of PCR primers for the identification of
related genes.
Global versus local alignments
Things would be much simpler if we only considered sequences that are homologous over their
entire length and could be globally aligned. However, homology is often restricted to certain
regions of sequence. Many proteins are multi-modular (as was discussed in Module 5) and the
shuffling of modules is part of the evolutionary process. Consequently any attempt to align, over
their entire length, a group of sequences that share some, but not all of their modules, would be
bound to lead to errors. In such a case a series of multiple local sequence alignments of each of
the modules would be appropriate
Substitutions and Gaps
In trying to establish the evolutionary trajectories of a group of related sequences the same
problem is encountered as met in pairwise alignment, namely how do deal with substitutions and
gaps. The solution is the same with the use of gap penalties, gap extension penalties and
substitution matrices such as PAM and BLOSUM (see Module 4).
Multiple sequence alignment algorithms
There are essentially four major approaches to multiple sequence alignment:




Optimal global sequence alignment
Progressive global alignment
Block-based global alignment
Motif-based local alignment
Optimal global sequence alignment
As the name suggest this approach attempts to align sequences along their entire length. The term
‘optimal’ is used in its mathematical sense in that it will give the best alignment amongst all the
possible solutions for a given scoring scheme. Whether the optimal alignment corresponds with
the biologically correct alignment will depend on a variety of factors such as the substitution
matrix used, the gap penalty and the scoring scheme. Optimal global sequence alignment
programs are very computer intensive and the complexity of the task increases exponentially with
the number of sequences. In consequence there are few programs which employ this approach.
There is one available on the Web, MSA.
Progressive global alignment
Progressive global alignment employs multiple pairwise alignments in a series of three steps:
 Estimate alignment scores between all possible pairwise combinations of sequences in the set
 Build a ‘guide tree’ determined by the alignment scores from the previous step
 Align the sequences on the basis of the guide tree
Each step can be carried out in a number of ways e.g. the first step can be carried by dynamic
progamming or by heuristic algorithms, the former giving more accurate scores and the latter are
faster. Progressive global alignment is the most commonly used method for aligning nucleotide
and protein sequences and the best known programs employing this approach are CLUSTALX
and CLUSTALW which can be used at a number of web sites: PIR, BCM Search Launcher
and.EBI (a sophisticated version),
This shows the input screen for CLLUSTALW at BCM Searchlauncher
and this shows a typical output screen.
Block-based global alignment
The principle of block-based global alignment is to divide the sequences into blocks which,
depending on the program, are exact (identical regions of sequence) or not exact and uniform
(found in every sequence) or not uniform. Once the blocks have been defined other approaches
are employed to align regions between the blocks. Examples of block-based global alignment
programs available on the Web are DCA and DIALIGN2.
Motif-based local alignment
Most recent local alignment programs employ computationally efficient heuristics to solve
optimization calculations for local alignments. The Gibbs iterative sampling approach is used to
find blocks in programs such as the excellent MACAW. Unfortunately, MACAW although
available as freeware is not available as a Web-based application. However, MEME is Webbased. Below is a typical MACAW analysis showing the related sequence blocks
Which method to use
Optimal global alignment programs are rarely employed because of their computationally
intensive requirements and their inability, at present, to handle more than a very small number of
sequences. Thus when the sequences to be aligned are homologous over their entire length a
progressive global alignment program such as CLUSTALW should be used. Where the sequences
share conserved modules in a consistent order and are separated by non-conserved regions a
blocks-based global alignment program such as DCA or DIALIGN2 is appropriate. Where the
sequences share conserved modules, but the order of modules is not consistent, a motif-based
local alignment such as MEME is the approach of choice.
Multiple sequence alignment file types
The various multiple sequence alignment programs will require different input file types and
there are also a variety of output file types. The sequences to be aligned are usually placed in a
single file commonly in the Fasta format. The common output file formats are: NBRF/PIR,
EMBL/SWISSPROT, Pearson (Fasta), Clustal (*.aln), GCG/MSF (Pileup), GCG9/RSF and GDE
flat file.
Editing alignments and producing figures
It is best practice once an alignment has been produced to check it by eye to detect obvious
errors. It is also desirable to be able to produce a high quality coloured image of the alignment
which illustrates where conserved and semi-conserved residues occur. There are a variety of
Web-based tools for these purposes including JALVIEW, CINEMA, SEAVIEW and
BOXSHADE.
EXERCISES
1
1.11
1.12
1.13
1.14
1.15
1.16
1.17
1.18
1.19
1.20
1.21
1.22
1.23
1.24
1.25
Search Entrex (proeins) to find records relating exclusively human leukocyte surface
antigens.
Select them all by clicking in the selection boxes
Get them displayed in the FASTA format
Set the display to plain text
Select all and copy
Go to the CLUSTALW site at BCM Search Launcher.
Paste from the clipboard into the text box and press submit
Once you get the results you can scroll down the page and you will find the alignment in
FASTA format highlighted in green
Select the whole of the green region and copy
Scroll to the bottom of the screen and click the link to BOXSHADE
Select your output format (determined by how you want to use the output file)
Select other from Input sequence format
Paste the FASTA alignment into the text window
Press ‘Run Boxshade’
Once the program has run you will get a page with a hyperlink to your output file which
you can then download for further use.
2.
Go back to your results at BCM search Launcher and try looking at them in the Java
Alignment Viewer
3.
Repeat the process but now use a search term of your own (Don’t try to align more than
about 10-15 sequence; you may have to refine your initial search)
Download