Module 6 Multiple Sequence Alignment Aims Objectives Introduction The result of searching databases using the techniques described in Module 4 is the establishment of a list of sequences, either protein or nucleotide, which exhibit significant similarity and are therefore inferred to be homologous. These sequences can then be subjected to multiple sequence alignment a process that involves an attempt to place residues in columns that derive from a common ancestral residue by substitutions. Gaps can be inserted to represent residues lost or gained by insertions and deletions (indels). The most successful alignment is the one that most closely represents the evolutionary history of the sequences. Why should one want to carry out this form of analysis. The most obvious reason is to attempt a phylogenetic analysis of the sequences so as to construct evolutionary trees (a major area of Bioinformatics, but one which would require a course just to itself). However, there are other very important reasons for carrying out multiple sequence alignments of related sequences which include the identification of functional sites, the identication of modules in multimodular protein (see Module 5), protein structure prediction (see Module 7), the detection of weak similarities in databases using profiles (see Psi-blast) and the design of PCR primers for the identification of related genes. Global versus local alignments Things would be much simpler if we only considered sequences that are homologous over their entire length and could be globally aligned. However, homology is often restricted to certain regions of sequence. Many proteins are multi-modular (as was discussed in Module 5) and the shuffling of modules is part of the evolutionary process. Consequently any attempt to align, over their entire length, a group of sequences that share some, but not all of their modules, would be bound to lead to errors. In such a case a series of multiple local sequence alignments of each of the modules would be appropriate Substitutions and Gaps In trying to establish the evolutionary trajectories of a group of related sequences the same problem is encountered as met in pairwise alignment, namely how do deal with substitutions and gaps. The solution is the same with the use of gap penalties, gap extension penalties and substitution matrices such as PAM and BLOSUM (see Module 4). Multiple sequence alignment algorithms There are essentially four major approaches to multiple sequence alignment: Optimal global sequence alignment Progressive global alignment Block-based global alignment Motif-based local alignment Optimal global sequence alignment As the name suggest this approach attempts to align sequences along their entire length. The term ‘optimal’ is used in its mathematical sense in that it will give the best alignment amongst all the possible solutions for a given scoring scheme. Whether the optimal alignment corresponds with the biologically correct alignment will depend on a variety of factors such as the substitution matrix used, the gap penalty and the scoring scheme. Optimal global sequence alignment programs are very computer intensive and the complexity of the task increases exponentially with the number of sequences. In consequence there are few programs which employ this approach. There is one available on the Web, MSA. Progressive global alignment Progressive global alignment employs multiple pairwise alignments in a series of three steps: Estimate alignment scores between all possible pairwise combinations of sequences in the set Build a ‘guide tree’ determined by the alignment scores from the previous step Align the sequences on the basis of the guide tree Each step can be carried out in a number of ways e.g. the first step can be carried by dynamic progamming or by heuristic algorithms, the former giving more accurate scores and the latter are faster. Progressive global alignment is the most commonly used method for aligning nucleotide and protein sequences and the best known programs employing this approach are CLUSTALX and CLUSTALW which can be used at a number of web sites: PIR, BCM Search Launcher and.EBI (a sophisticated version), This shows the input screen for CLLUSTALW at BCM Searchlauncher and this shows a typical output screen. Block-based global alignment The principle of block-based global alignment is to divide the sequences into blocks which, depending on the program, are exact (identical regions of sequence) or not exact and uniform (found in every sequence) or not uniform. Once the blocks have been defined other approaches are employed to align regions between the blocks. Examples of block-based global alignment programs available on the Web are DCA and DIALIGN2. Motif-based local alignment Most recent local alignment programs employ computationally efficient heuristics to solve optimization calculations for local alignments. The Gibbs iterative sampling approach is used to find blocks in programs such as the excellent MACAW. Unfortunately, MACAW although available as freeware is not available as a Web-based application. However, MEME is Webbased. Below is a typical MACAW analysis showing the related sequence blocks Which method to use Optimal global alignment programs are rarely employed because of their computationally intensive requirements and their inability, at present, to handle more than a very small number of sequences. Thus when the sequences to be aligned are homologous over their entire length a progressive global alignment program such as CLUSTALW should be used. Where the sequences share conserved modules in a consistent order and are separated by non-conserved regions a blocks-based global alignment program such as DCA or DIALIGN2 is appropriate. Where the sequences share conserved modules, but the order of modules is not consistent, a motif-based local alignment such as MEME is the approach of choice. Multiple sequence alignment file types The various multiple sequence alignment programs will require different input file types and there are also a variety of output file types. The sequences to be aligned are usually placed in a single file commonly in the Fasta format. The common output file formats are: NBRF/PIR, EMBL/SWISSPROT, Pearson (Fasta), Clustal (*.aln), GCG/MSF (Pileup), GCG9/RSF and GDE flat file. Editing alignments and producing figures It is best practice once an alignment has been produced to check it by eye to detect obvious errors. It is also desirable to be able to produce a high quality coloured image of the alignment which illustrates where conserved and semi-conserved residues occur. There are a variety of Web-based tools for these purposes including JALVIEW, CINEMA, SEAVIEW and BOXSHADE. EXERCISES 1 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.20 1.21 1.22 1.23 1.24 1.25 Search Entrex (proeins) to find records relating exclusively human leukocyte surface antigens. Select them all by clicking in the selection boxes Get them displayed in the FASTA format Set the display to plain text Select all and copy Go to the CLUSTALW site at BCM Search Launcher. Paste from the clipboard into the text box and press submit Once you get the results you can scroll down the page and you will find the alignment in FASTA format highlighted in green Select the whole of the green region and copy Scroll to the bottom of the screen and click the link to BOXSHADE Select your output format (determined by how you want to use the output file) Select other from Input sequence format Paste the FASTA alignment into the text window Press ‘Run Boxshade’ Once the program has run you will get a page with a hyperlink to your output file which you can then download for further use. 2. Go back to your results at BCM search Launcher and try looking at them in the Java Alignment Viewer 3. Repeat the process but now use a search term of your own (Don’t try to align more than about 10-15 sequence; you may have to refine your initial search)