Multiple Sequence Alignment • • • • • • What is it Why do we use it How to use it Tools ClustalW Exercise Multiple Sequence Alignment • Many genes are represented in highly conserved forms in a wide range of organisms • Patterns of change in these gene sequences may be analyzed by simultaneous alignment of the sequences (identify conserved regions) • This is known as multiple sequence alignment (msa) • A multiple alignment arranges a set of sequences in a scheme where positions believed to be homologous are written in a common column. Applications of Multiple Sequence Alignment • Predict protein function • Predict protein structure (using structure superposition programs). • Predict the evolutionary history of sequences (using phylogenetic analysis programs). • Contig Assembly (Shotgun sequences & ESTs) • Identify new family members • Design PCR primers for amplification of related sequences • Database searching with the consensus sequences to identify other sequences with a similar pattern. Multiple Sequence Alignment Guidelines • Select the sequences carefully. Make sure they are members of the same family and they all share a common ancestor • Use protein sequences if possible. Translate if necessary and then convert back to DNA after the alignment. • Protein seqs are three times shorter and provide a more informative alphabet • If there is little signal at the aa level there will be no signal at the nt level • If you are interested in non-coding sequences you have no choice but beware DNA alignment is tricky (need a very high level of conservation) Multiple Sequence Alignment Guidelines Cont. • Ensure that at least half of the sequences share more than 30% identity and avoid sequences that have > 90% identity to another sequence • An alignment that contains only very similar sequences is not very informative • If you make sure that each sequence is between 30 and 70% identical with half of the sequences in the set you will have made a reasonable compromise between new information and alignment quality Multiple Sequence Alignment Guidelines Cont. • Start with 10-15 sequences and avoid aligning more than 50 sequences (if you do employ a high level of manual curation) • Multiple alignment programs are not good at handling large sets of sequences. • Visualizing many alignments is difficult and if it falls on more than one page interpretation can become difficult if not impossible. • Aligning a lot of sequences is computationally difficult and public servers have limited resources, so it may take a long time to run and make it difficult for you to fine tune alignment parameters or alternative sequences. Multiple Sequence Alignment Guidelines Cont. • Tree building and structure prediction programs do not handle big alignments well • Making accurate big alignments is difficult and not so reliable making it difficult to have confidence in the fidelity of the sequences that you are saying belong to a family. Best to start small and gradually increase the size of the multiple alignments. • Before adding a sequence to a multiple alignment, you can figure out whether it is a good choice by doing a pairwise comparison. Multiple Sequence Alignment Guidelines Cont. • Use sequences of similar length. Programs have problems aligning partial and complete sequences. • Repeated domains are problematic for the alignment programs, especially if the number of domains is different. • Name Sequences appropriately • Never use white spaces such as clone 2 (clone2 or clone_2) • Do not use special symbols, stick to plain letters, numbers and the underscore • Do not use names any longer than 15 characters • Use unique names for each sequence • Use informative names (OSJLBa0001A01f compared to Main_Clone1) EXPASY INTEGRATED BLAST & MSA SERVER EXPASY INTEGRATED BLAST & MSA SERVER (databases and options) • Output of search displayed • Links to Pfam Scroll down • View Alignments (helps inform selection) • Make selections for inclusion in msa • Send your selections options • Select your sequences in fasta format • Send your selections options • Example selected sequences • Note the range of scores and E values selected ACC # SwissProt # P20472 P80079 P02626 P02619 P43305 P32930 Q91482 P02620 P02622 Description Organism Score EXP PRVA_HUMAN Parvalbumin alpha [PVALB] [Homo sapiens 186 PRVA_FELCA Parvalbumin alpha [PVALB] [Felis silvestris... 162 PRVA_AMPME Parvalbumin alpha [Amphiuma (Salamand... 109 PRVB_ESOLU Parvalbumin beta [Esox lucius (Northern pike)] 95 PRVU_CHICK Parvalbumin, thymic CPV3 (Parvalbumin 3) [G 92 ONCO_HUMAN Oncomodulin (OM) (Parvalbumin beta) [OCM] 89 PRV1_SALSA Parvalbumin beta 1 (Major allergen Sal s 1)... 85 PRVB_MERME Parvalbumin beta [Merluccius merluccius (Eu... 80 PRVB_GADCA Parvalbumin beta (Allergen Gad c 1) (Gad c ... 74 9e-47 1e-39 1e-23 2e-19 2e-18 2e-17 3e-16 7e-15 5e-13 Multiple Sequence Alignment Software • • • • • • • • • • ClustalW (Unix, Mac, PC, VMS). ClustalX (IGBMC , EBI) (graphical interface) (Unix, Mac, PC, VMS). Multalin MSA (Unix). DIALIGN (Unix). DCA (Unix). Multiple alignment by randomized iterative strategy (Unix). MACAW (Mac, PC). T-Coffee (Unix). MAFFT (Linux, Unix, Windows XP, Mac OS X). Multiple Sequence Alignment Online Tools • ClustalW at EBI (Hinxton, UK). Display and edit alignments with JalView. • ClustalW, Multalin at PBIL (Lyon, France). Colored alignments and secondary • • • • • • • • • • • • • • • • • structure predictions. ClustalW, MAP, PIMA at BCM MSA, ClustalW, ctree at IBC (St Louis, USA) Multalin at INRA (Toulouse, France). Colored alignments. ClustalW, DCA, DIALIGN2 at Pasteur (Paris, France) ClustalW at EMBL (Heidelberg, Germany). Performs multiple alignment on homologous sequences detected by BLAST. ClustalW at DDBJ (Mishima, Japan) MAP (Michigan Tech. Univ., USA) ProbModel at CBRG (Zurich, Switzerland) DIALIGN2 at BiBiServ (Bielefeld, Germany) DCA at BiBiServ (Bielefeld, Germany) ITERALIGN (Stanford, USA) T-COFFEE (Lausanne, Switzerland) MATCH-BOX (Namur, Belgium) BLOCK Maker at FHCRC (Washington, USA) MEME at SDSC (San Diego, USA) MEME at Pasteur (Paris, France) PIMA II at BMERC (Boston, USA) MAVID at UCB (Berkeley, USA) Multiple Sequence Alignment Software: ClustalW • First msa that could run on almost any platform • Most widely used msa program • ClustalW is the latest version • There are many Clustal servers around the world, most operating the same version but their different interfaces provide access to different options. • It is available as a stand-alone package also. Multiple Sequence Alignment Software: ClustalW • CLustalW uses a progressive method to build its alignments • It compares two sequences at a time and clusters them by similarity. • This clustering resembles a phylogenetic tree (.dnd file from ClustalW output). This clustering is called as dendogram A B Root C D • Reveals that A and B are more similar than C and D • To make the progressive alignment ClustalW follows the dendogram and starts aligning A and B and then C and D. • It then treats the multiple alignments like single sequences and aligns them two by two. Multiple Sequence Alignment Software: ClustalW • Pairwise Scores • This is the pairwise comparisons ClustalW uses to build its tree • This can be ignored Multiple Sequence Alignment Software: ClustalW • Shows the alignment • Can be saved as a text file • Can view it in color Multiple Sequence Alignment Software: ClustalW • The Guide Tree • Shows the tree that ClustalW uses to guide its progressive alignment • It is displayed in Phylip tree format • A cladogram is a branching diagram (tree) assumed to be an estimate of a phylogeny where the branches are of equal length, thus cladograms show common ancestry, but do not indicate the amount of evolutionary "time" separating taxa Multiple Sequence Alignment Software: ClustalW • The Phylogram Tree • A Phylogram is a branching diagram (tree) assumed to be an estimate of a phylogeny, branch lengths are proportional to the amount of inferred evolutionary change Interpreting Multiple Sequence Alignments • Interpreting an alignment is more art than a science !!| • No E values exist to tell us how reliable the search was as in database searching • Best method of evaluation is based on knowledge of protein structures. • Structures contain loops that evolve rapidly • Loops are softer portions of the protein that connect its more rigid portions • Protein structures also contain core regions inside the protein that act as support walls for the protein. These support walls evolve less rapidly than the loops on the surface • In a good multiple alignment can expect to find nice gap free blocks that correspond to core regions and gap rich regions that correspond to the loops Interpreting Multiple Sequence Alignments • How Can you tell whether a block is good? • Take a look at the alignment symbols • * A star indicates an entirely conserved region • : A colon indicates columns where all the residues have roughly the same size and same hydropathy • . A period indicates columns where the size or hydropathy has been preserved in the course of evolution • An average GOOD block is at least 10-30 aa long exhibiting at least 1 to 3 stars, five to seven colons and a few periods • In a good multiple alignment can expect to find nice gap free blocks that correspond to core regions and gap rich regions that correspond to the loops Multiple Sequence Alignment Tools • BLAST Servers with integrated MSA’s • www.expasy.ch/cgi-bin/BLASTEMBnet-ch.pl • • • • • Extract entire sequences Export sequences in FASTA format Submit sequences to ClustalW Submit sequences to Tcofee http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_blast.html • • • • Extract entire sequences Extract sequence fragments Export sequences in FASTA format Submit sequences to ClustalW • srs.ebi.ac.uk • Submit sequences to ClustalW www.expasy.ch/cgi-bin/BLASTEMBnet-ch.pl http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_blast.html srs.ebi.ac.uk