Last lecture summary • Sequence database searching • exhaustive, heuristic • BLAST • How it works, steps, parameters • BLAST variants, reading frame 1 query sequence For each word, the list of similar words is created using a substitution matrix 2 database sequences scan match list The query sequence is cut in words of length W the extension of the similarity on both sides of the word extend 3 high scoring pair New stuff Multiple sequence alignment MSA What is MSA • Comparison of many (i.e., >2) sequences • local or global Why MSA • Biological sequences often occur in families. Homologous sequences often retain similar structures and functions. • related genes within an organism • genes in various species • sequences within a population (polymorphic variants) • MSA reveals more biological information than pairwise alignment • two sequences that may not align well to each other can be aligned via their relationship to a third sequence, thereby integrating information in a way not possible using only pairwise alignments Sequence motif • A short conserved region in DNA, RNA or protein sequence • Corresponds to a structural or functional feature in proteins • Shared by several sequences, can be generated by MSA Edgar R.S. et al. Peroxiredoxins are conserved markers of circadian rhythms. Nature 2012 485(7399):459-64 Protein family • A protein family is a group of evolutionarily-related proteins. • Proteins in a family descend from a common ancestor and typically have similar three-dimensional structures, functions, and significant sequence similarity. • Currently, several thousands protein families have been defined, although ambiguity in the definition of protein family leads different researchers to wildly varying numbers Edgar R.S. et al. Peroxiredoxins are conserved markers of circadian rhythms. Nature 2012 485(7399):459-64 Pfam - http://pfam.xfam.org/ / • Database of protein families that includes their annotations and multiple sequence alignments Sequence logo Sequence logo Conserved: BIG letters with few others in that space Divergent: small letters with many others in that space Logo visualization of the alignment • Make a logo from your alignment • Can be easier to compare • Nice graphic • Students love them • http://weblogo.berkeley.edu/logo.cgi Doing MSA • As with aligning a pair of sequences, the difficulty in aligning a group of sequences varies considerably with sequence similarity. • If the amount of sequence variation is minimal, it is quite straightforward to align the sequences, even without the assistance of a computer program. • If the amount of sequence variation is great, it may be very difficult to find an optimal alignment of the sequences because so many combinations of substitutions, insertions, and deletions, each predicting a different alignment, are possible. Challenges of the MSA • Finding an optimal alignment of more than two sequences that includes matches, mismatches and gaps, and that takes into account the degree of variation in all of the sequences at the same time poses a very difficult challenge. • A second computational challenge is identifying a reasonable method of obtaining a cumulative score for the substitutions in the column of an MSA. MSA algorithms • As with the pairwise sequence comparisons, there are two types of multiple alignment algorithms • optimal • heuristic Optimal algorithms • Extension of dynamic programming to multiple sequences • Exhaustive search • Produce best alignment • Computationally expensive • Not feasible for n>10 sequences of length m>200 residues Heuristic algorithms • Limit the exhaustive search • Attempt to rapidly find a good, but not necessarily optimal alignment • Most popular methods: • progressive methods (ClustalW) • start from the most similar sequences and progressively add new sequences • iterative methods (MUSCLE) • make initial crude alignment, then revise it Progressive sequence alignment • The most commonly used algorithm, the most commonly used software ClustalW. • Latest upgrade – Clustal OMEGA • Permits the rapid alignment of even hundreds of sequences. • Limitation: the final alignment depends on the order in which sequences are joined. • Not guaranteed to provide the most accurate alignments. ClustalW • http://www.clustal.org • EMBOSS – a free open source software analysis package (European Molecular Biology Open Software Suite) http://emboss.sourceforge.net • Program emma is a ClustalW wrapper • A variety of EMBOSS servers hosting emma are available, e.g. http://embossgui.sourceforge.net/demo/emma.html • ClustalX – a downloadable stand-alone program offering a graphical user interface for editing multiple sequence alignments • http://www.clustal.org/clustal2/ Sequences to align >neuroglobin 1OJ6A NP_067080.1 [Homo sapiens] -------------MERPEPELIRQSWRAVSRSPLEHGTVLFARLFALEPDLLPLFQYNCR QFSSPEDCLSSPEFLDHIRKVMLVI---DAAVTNVEDLSSLEEYLASLGRKHRAVGVKLS SFSTVGESLLYMLEKCLGPA-FTPATRAAWSQLYGAVVQAMSRGWDGE--->rice_globin 1D8U rice Non-Symbiotic Plant Hemoglobin NP_001049476.1 [Oryza sativa (japonica cultivar-group)] MALVEDNNAVAVSFSEEQEALVLKSWAILKKDSANIALRFFLKIFEVAPSASQMFSF-LR NSDVP-LEKNPKLKTHAMSVFVMTCEAAAQLRKAGKVTVRDTTLKRLGATHLKYGVGDA HFEVVKFALLDTIKEEVPADMWSPAMKSAWSEAYDHLVAAIKQEMKPAE-->soybean_globin 1FSL leghemoglobin P02238 LGBA_SOYBN [Glycine max] ----------MVAFTEKQDALVSSSFEAFKANIPQYSVVFYTSILEKAPAAKDLFSF-LA NGVDP---TNPKLTGHAEKLFALVRDSAGQLKASGTVVAD----AALGSVHAQKAVTDP QFVVVKEALLKTIKAAVGDK-WSDELSRAWEVAYDELAAAIKKA------->beta_globin 2hhbB NP_000509.1 [Homo sapiens] ----------MVHLTPEEKSAVTALWGKVNVD--EVGGEALGRLLVVYPWTQRFFES-FG DLSTPDAVMGNPKVKAHGKKVLGAF---SDGLAHLDNLKGTFATLSELHCDKLH--VDPE NFRLLGNVLVCVLAHHFGKE-FTPPVQAAYQKVVAGVANALAHKYH----->myoglobin 2MM1 NP_005359.1 [Homo sapiens] -----------MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDK-FK HLKSEDEMKASEDLKKHGATVLTAL---GGILKKKGHHEAEIKPLAQSHATKHK--IPVK YLEFISECIIQVLQSKHPGD-FGADAQGAMNKALELFRKDMASNYKELGFQG ClustalW – how it works? Three stages 1st stage Global alignment (Needlman-Wunsch) is used to create pairwise alignments of every protein pair. number of pairwise alignments 𝑛(𝑛 − 1) 2 2nd stage • A guide tree is calculated from these scores. • A guide tree is a template used in the third stage of ClustalW to define the order in which sequences are added to multiple alignment. Newick format The construction of a guide tree • Unweighted Pair Group Method with Arithmetic Mean (UPGMA) • A simple hierarchical clustering method • How it works? http://www.southampton.ac.uk/~re1u06/teaching/upgma/ • Neighbor joining UPGMA based on A Tutorial on Clustering Algorithms http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html Similarity A B C Let’s describe each object by a vector containing: • The size of the object (‘b’ or ‘s’) • The color of the inner area (‘g’ or ‘b’) • The color of the border line (‘r’ or ‘o’) • The width of the border line (3 or 6) D Similarity The more two objects are similar, the more features they share. B A A B C D = = = = (b, (b, (b, (s, g, b, b, b, C r, o, o, r, 6) 6) 3) 6) D Similarity Let’s score the vector pairs. Match = 1 Mismatch = 0 B A C D similarity A B C D = = = = (b, (b, (b, (s, g, b, b, b, r, o, o, r, 6) 6) 3) 6) AB = AC = AD = BC = BD = CD = Similarity and distance While similarity between two objects is a number of features they share, distance is a number of features they differ at. B A A B C D = = = = (b, (b, (b, (s, g, b, b, b, C r, o, o, r, 6) 6) 3) 6) D similarity dist AB = 2 AB = AC = 1 AC = AD = 2 AD = BC = 3 BC = BD = 1 BD = CD = 1 CD = Cluster analysis We have data, we don’t know classes. Assign data objects into groups (called clusters) so that data objects from the same cluster are more similar to each other than objects from different clusters. Cluster analysis We have data, we don’t know classes. Assign data objects into groups (called clusters) so that data objects from the same cluster are more similar to each other than objects from different clusters. Cluster analysis • Many algorithms for cluster analysis exist. • We are interested in so-called hierarchical clustering. • UPGMA is a hierarchical clustering method. • Hierarchical clustering algorithms connect objects to form clusters based on their distance. Closest (i.e. most similar) objects are connected first and then other objects are gradually added. • This process is represented graphically by a tree structure called a dendrogram. In UPGMA, a guide tree is a dendrogram. • I will explain the construction of a guide tree (dendrogram) in UPGMA in the next section. BA FL MI/TO MI/TO NA RM 0 BA BA 0 FL 662 FL MI NA RM TO 662 877 255 412 996 0 MI 877 295 295 468 268 400 0 NA 255 468 754 754 564 138 0 RM 412 268 564 219 219 869 0 TO 996 400 138 869 669 669 0 BA MI/TO FL 936.5 MI/TO NA RM 0 𝟖𝟕𝟕 + 𝟗𝟗𝟔 = 𝟗𝟑𝟔. 𝟓 𝟐 BA BA 0 FL 662 FL MI NA RM TO 662 877 255 412 996 0 MI 877 295 295 468 268 400 0 NA 255 468 754 754 564 138 0 RM 412 268 564 219 219 869 0 TO 996 400 138 869 669 669 0 BA MI/TO FL MI/TO 936.5 347.5 0 BA FL BA 0 FL 662 NA MI RM NA RM TO 662 877 255 412 996 0 MI 877 295 295 468 268 400 0 NA 255 468 754 754 564 138 0 RM 412 268 564 219 219 869 0 TO 996 400 138 869 669 669 0 BA MI/TO FL MI/TO NA 936.5 347.5 0 811.5 BA FL BA 0 FL 662 MI RM NA RM TO 662 877 255 412 996 0 MI 877 295 295 468 268 400 0 NA 255 468 754 754 564 138 0 RM 412 268 564 219 219 869 0 TO 996 400 138 869 669 669 0 BA MI/TO FL MI/TO NA RM 936.5 347.5 0 811.5 616.5 BA FL BA 0 FL 662 MI NA RM TO 662 877 255 412 996 0 MI 877 295 295 468 268 400 0 NA 255 468 754 754 564 138 0 RM 412 268 564 219 219 869 0 TO 996 400 138 869 669 669 0 BA FL MI/TO NA RM BA 0 662 936.5 255 412 FL 662 0 347.5 468 268 MI/TO 936.5 347.5 0 811.5 616.5 NA 255 468 811.5 0 219 RM 412 268 616.5 219 0 BA FL MI/TO NA/RM BA 0 662 936.5 333.5 FL 662 0 347.5 268 MI/TO 936.5 347.5 0 714 NA/RM 333.5 268 714 0 FL/NA/RM BA MI/TO FL/NA/RM 0 497.75 530.75 BA 497.75 0 936.5 MI/TO 530.75 936.5 0 BA/FL/NA/RM MI/TO BA/FL/NA/RM 0 733.625 MI/TO 733.625 0 Torino → Milano Rome → Naples Dendrogram → Florence → Bari Join Torino–Milano and Rome–Naples–Bari–Florence Dendrogram Torino → Milano (138) Rome → Naples (219) → Florence (268) → Bari (497.75) Join Torino–Milano and Rome–Naples–Bari–Florence (733.625) RM 138 NA dissimilarity 733.625 497.75 219 268 FL BA MI TO 3rd stage • Multiple sequence alignment is created in the series of steps based on the order presented in the guide tree. • First select two most closely related sequences from the guide tree and create a pairwise alignment. 3rd stage 3rd stage • The next sequence is either added to the pairwise alignment (to generate an aligned group of three sequences, sometimes called a profile) or used in another pairwise alignment. • At some point, profiles are aligned with profiles. • The alignment continues progressively until the root of the tree is reached, and all sequences have been aligned. Gaps • “once a gap, always a gap” rule • The most closely related pair of sequences is aligned first. • As further sequences are added to the alignment, there are many ways that gaps could be included. • Gaps are often added to first two (closest) sequences. Iterative approaches • Progressive alignment methods have the inherent limitation that once an error occurs in the alignment process it cannot be corrected. • Iterative approaches can overcome this limitation. • Create an initial alignment and then modify it to try to improve it. • e.g. MUSCLE, IterAlign, Praline, MAFFT MUSCLE • Since its introduction in 2004, the MUSCLE program of Robert Edgar has become popular because of its accuracy and its exceptional speed, especially for multiple sequence alignments involving large number of sequences. • Multiple sequence comparison by log expectation • Three stages MUSCLE 1. Draft alignment 2. Improved alignment 3. Refinement Edgar, R. C. Nucl. Acids Res. 2004 32:1792-1797; doi:10.1093/nar/gkh340 MUSCLE online https://www.ebi.ac.uk/Tools/msa/muscle/