Alignments and alignment reliability The first critical step in sequence analysis – the know how Eyal Privman and Osnat Penn Tel Aviv University COST Training School Rehovot, 2010 What are alignments good for? To compare sequences Find homology Similar sequence similar function To learn about sequence evolution Mismatch = point mutation Gap = indel (insertion or deletion) Reconstruct phylogenetic tree Infer selection forces, e.g., detecting positive selection Sequences evolution ATGAAATAA 30 MYA ATGTTTTAA 5 MYA Today ATGTTTTAA Human A T Chimp Mouse A T A T G - ATGCCCAAATAA ATGCCCAAATAA ATGTTT - - T T T T A A G - - - T T T G C C C A A A T - A A Alignment and phylogeny are mutually dependant MSA Unaligned sequences Sequence alignment Phylogeny reconstruction Inaccurate tree building 0.4 Alignment and phylogeny are both challenging 25% of residues are aligned wrong Based on BAliBASE: a large representative set of proteins Alignment and phylogeny are both challenging 5% of tree branches are wrong Based on simulations of 100 protein sequences Making an alignment For 2 sequences : use exact methods. For more sequences: Exact methods are not feasible (too slow) We use heuristic methods Progressive alignment A B First step: C compute pairwise distances D E Compute the pairwise alignments for all against all (10 pairwise alignments). The similarities are converted to distances and stored in a table A B C D A B 8 C 15 17 D 16 14 10 E 32 31 31 32 E Second step: build a guide tree Cluster the sequences to create a tree (guide tree): A B C D A B 8 C 15 17 D 16 14 10 32 31 31 32 • represents the order The in whichguide pairs of treeEis imprecise sequences are to be aligned and is NOT the tree which • similar sequences are neighbors in the truly describes the tree • distant sequences are distant from each A evolutionary relationship other in the tree between the sequences! B C D E E Third step: align sequences in a bottom up order A Sequence A Sequence B B C D E 1. Align the most similar (neighboring) pairs 2. Align pairs of pairs 3. Align sequences clustered to pairs of pairs deeper in the tree Sequence C Sequence D Sequence E Multiple sequence alignment (MSA) A B C D E Pairwise distance table Iterative progressive alignment Guide tree A B C D E MSA Multiple sequence alignment (MSA) Several advanced MSA programs are available. Today we will use two: MAFFT – fastest and one of the most accurate PRANK – distinct from all other MSA programs because of its correct treatment of insertions/deletions Nucleic Acids Research, 2002, Vol. 30, No. 14 3059-3066 © 2002 Oxford University Press MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform MAFFT Kazutaka Katoh, Kazuharu Misawa1, Kei-ichi Kuma and Takashi Miyata* Web server & download: http://align.bmr.kyushu-u.ac.jp/mafft/online/server/ Efficiency-tuned variants quick & dirty or slow but accurate Choosing a MAFFT strategy quick & dirty slow but accurate Choosing a MAFFT strategy quick & dirty slow but accurate Choosing a MAFFT strategy quick & dirty slow but accurate Choosing a MAFFT strategy quick & dirty slow but accurate E-INS-i oooooooooXXX------XXXX---------------------------------XXXXXXXXXXX-XXXXXXXXXXXXXXXooooooooooooo ---------XXXXXXXXXXXXXooo------------------------------XXXXXXXXXXXXXXXXXX-XXXXXXXX-----------------ooooXXXXXX---XXXXooooooooooo----------------------XXXXX----XXXXXXXXXXXXXXXXXXooooooooooooo ---------XXXXX----XXXXoooooooooooooooooooooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX---------------------XXXXX----XXXX---------------------------------XXXXX---XXXXXXXXXX--XXXXXXXooooo-------- L-INS-i G-INS-i ooooooooooooooooooooooooooooooooXXXXXXXXXXX-XXXXXXXXXXXXXXX------------------ XXXXXXXXXXX-XXXXXXXXXXXXXXX --------------------------------XX-XXXXXXXXXXXXXXX-XXXXXXXXooooooooooo------- XX-XXXXXXXXXXXXXXX-XXXXXXXX ------------------ooooooooooooooXXXXX----XXXXXXXX---XXXXXXXooooooooooo------- XXXXX----XXXXXXXX---XXXXXXX --------ooooooooooooooooooooooooXXXXX-XXXXXXXXXX----XXXXXXXoooooooooooooooooo XXXXX-XXXXXXXXXX----XXXXXXX --------------------------------XXXXXXXXXXXXXXXX----XXXXXXX------------------ XXXXXXXXXXXXXXXX----XXXXXXX MAFFT output A colored view of the alignment Saving the output Choose a format: Clustal, Fasta, or click "Reformat" to convert to a selection of other formats Save page as a text file e.g. save as "phylip" file and upload to PhyML for reconstructing the tree PhyML: tree reconstruction The most widely used maximum likelihood (ML) program Web server & download: http://www.atgc-montpellier.fr/phyml/ PRANK Classical alignment errors for HIV env PRANK Web server: http://www.ebi.ac.uk/goldman-srv/webPRANK/ PRANK output If you need a different format – copy the results to the READSEQ sequence converter: http://www-bimas.cit.nih.gov/molbio/readseq/ 1. Download and save the sequences file from Osnat's homepage (you can google “Osnat Penn" and look for the workshop materials under "Teaching"). Save the file as "trim5a.AA.fas" (File “Save page as”). This file contains 20 protein sequences in FASTA format. 2. Run PRANK web-server to create a protein alignment: a. In the “Default alignment” section browse for “trim5a.AA.fas”. b. Run (press the “Start alignment“ button) . 3. While you wait: copy the sequences into the MAFFT web server and run the "automatic" "moderately accurate" strategy – which strategy did MAFFT choose for you? Click on the "Fasta format“ link, and save as “trim5a.AA.mafft.aln“ (File “Save page as”) and try the "Jalview" button. 4. When PRANK finishes click on the “Show Fasta file” button, and save the MSA by the name “trim5a.AA.prank.aln“. Sources of alignment errors Progressive alignment algorithms are greedy heuristics Co-optimal solutions Heads-or-Tails (HoT) scores (Landan & Graur 2007) Guide-tree errors GUIDANCE scores (Penn, Privman et al. MBE 2010) GUIDANCE: Guide-tree based alignment confidence scores Base MSA Bootstrap sampling of NJ trees Progressive alignment Tree 1 Tree 2 … Tree 99 Tree 100 MSA 1 MSA 2 … MSA 99 MSA 100 GUIDANCE Scores Confident Uncertain 1 0 Penn, Privman et al. MBE. 2010 http://guidance.tau.ac.il Extracellular domain (a) Transmembrane domain Cytoplasmic domain HIV1 group M SIV chimp HIV1 group N HIV1 group O SIV gorilla GUIDANCE Scores GUIDANCE score SIV cerco Column Confident Uncertain Extracellular domain (b) Transmembrane domain Cytoplasmic domain HIV1 group M SIV chimp GUIDANCE score HIV1 group O Column 1. Run GUIDANCE web-server to calculate confidence scores for the MAFFT alignment: a. In the “Upload your sequence file” window browse for “trim5a.AA.fas”. b. Choose “Amino Acids” in the “Sequences Type” option. c. In order to speed the run, change the “Number of bootstrap repeats” in the “Advanced options” section to 30. Note that this is not recommended for real life. d. Run (press the “Submit“ button) . Detecting selection forces Positive selection Empirical findings variation among genes: “Important” proteins evolve slower than “unimportant” ones Histone 3 protein Empirical findings variation among sites: Functional sites evolve slower than nonfunctional sites Silent and non-silent mutations Silent: UUU -> UUC (both encode phenylalanine) Non-silent: UUU -> CUU (phenylalanine to leucine) For most proteins, the rate of silent substitutions is much higher than the non-silent rate This is called purifying selection = conservation There are rare cases where the non-silent rate is much higher than the silent rate This is called positive selection Positive Selection Examples: Pathogen proteins evading the host immune system Proteins of the immune system detecting pathogen proteins Pathogen proteins that are drug targets Proteins that are products of gene duplication Proteins involved in the reproductive system http://selecton.tau.ac.il Selecton results False positive predictions Selecton uses an MSA as input The MSA may contain unreliable regions Errors in Selecton computations Errors in the positive selection inference 1. Go to the GUIDANCE results of the last exercise. 2. Which columns are not well aligned? Are these sites also predicted to evolve under positive selection? See Selecton results in: http://selecton.tau.ac.il/results/1268662868/colors.html Summary Different alignment programs may result different MSAs. Alignment uncertainty may cause errors in downstream analyses such as positive selection analysis. GUIDANCE can detect alignment errors. Thanks for your attention!