Growing Trees on the Right Compost Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program Manguel M, Samaniego F.J., Abraham Wald’s Work on Aircraft Suvivability, J. American Statistical Association. 79, 259-270, (1984) What ‘s in a Multiple Sequence Alignment Selection Important Features Are Preserved Evolution Inertia Functional Constraint Common Ancestry Shows up In the sequences Phylogenetic Footprint, Evolutionary Trace … Same Function Same Sequence Convergence Why So Much Interest For Multiple Alignments ? Extrapolation Structure Prediction Motifs/Patterns SNP Analysis Profiles Regulatory Elements Phylogeny Reactivity Analysis What’s in a Multiple Alignment ? The MSA contains what you put inside: – – – Structural Similarity Evolutive Similarity Sequence Similarity You can view your MSA as: – – – A record of evolution A summary of a protein family A collection of experiments made for you by Nature… Producing The Right Alignment Multiple Sequence Alignments Influence Phylogenetic Trees Choice of Method is not Neutral – – – Different Methods Different Alignments Different Trees Using The Right Models insures Producing the right Tree Model Based Alignments vs Naïve Alignments Naïve Alignment – – – Model Based Alignments – – – – Lexicographic Alignment Maximizing the number of identities At best using a substitution matrix Using a model Protein structure information RNA Structure information Combining/Confronting Modeling methods Template based Alignments – Model based Alignments through the use of Templates T-Coffee and Model Based Alignments T-Coffee Algorithm Expresso: Aligning Protein Structures R-Coffee: Aligning RNA structures M-Coffee: Combining methods T-Coffee: An extension of the progressive Alignment Algorithm T-Coffee and Concistency… SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT SeqC GARFIELD THE VERY FAST CAT SeqD THE FAT CAT SeqA SeqB SeqC SeqD GARFIELD GARFIELD GARFIELD -------- THE THE THE THE LAST FAST VERY ---- FA-T CA-T FAST FA-T CAT --CAT CAT T-Coffee and Concistency… SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT --- Prim. Weight =88 SeqA GARFIELD THE LAST FA-T CAT SeqC GARFIELD THE VERY FAST CAT Prim. Weight =77 SeqA GARFIELD THE LAST FAT CAT SeqD -------- THE ---- FAT CAT Prim. Weight =100 SeqB GARFIELD THE ---- FAST CAT SeqC GARFIELD THE VERY FAST CAT Prim. Weight =100 SeqC GARFIELD THE VERY FAST CAT SeqD -------- THE ---- FA-T CAT Prim. Weight =100 T-Coffee and Concistency… SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT --- Prim. Weight =88 SeqA GARFIELD THE LAST FA-T CAT SeqC GARFIELD THE VERY FAST CAT Prim. Weight =77 SeqA GARFIELD THE LAST FAT CAT SeqD -------- THE ---- FAT CAT Prim. Weight =100 SeqB GARFIELD THE ---- FAST CAT SeqC GARFIELD THE VERY FAST CAT Prim. Weight =100 SeqC GARFIELD THE VERY FAST CAT SeqD -------- THE ---- FA-T CAT Prim. Weight =100 SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT --- Weight =88 SeqA GARFIELD THE LAST FA-T CAT SeqC GARFIELD THE VERY FAST CAT SeqB GARFIELD THE ---- FAST CAT Weight =77 SeqA GARFIELD THE LAST FA-T CAT SeqD -------- THE ---- FA-T CAT SeqB GARFIELD THE ---- FAST CAT Weight =100 T-Coffee and Concistency… SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT --- Weight =88 SeqA GARFIELD THE LAST FA-T CAT SeqC GARFIELD THE VERY FAST CAT SeqB GARFIELD THE ---- FAST CAT Weight =77 SeqA GARFIELD THE LAST FA-T CAT SeqD -------- THE ---- FA-T CAT SeqB GARFIELD THE ---- FAST CAT Weight =100 T-Coffee and Concistency… T-Coffee and Concistency… T-Coffee and Concistency… T-Coffee and Concistency… When Sequences Are not Enough 3D-Coffee and Expresso 3D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments 3D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments Expresso: Finding the Right Structure Sources BLAST BLAST Templates SAP Templates Template Alignment Source Template Alignment Remove Templates Library 3D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments Incorporating RNA Information Within the T-Coffee Algorithm ncRNAs Can Evolve Rapidly A A C CA C G G G G A A CG G G C A T A T C G G C G C A T C G C G A A C CA C G G G G A A CG G C G T A CCAGGCAAGACGGGACGAGAGTTGCCTGG T A G C CCTCCGTTCAGAGGTGCATAGAACGGAGG C G **-------*--**---*-**------** C G T A C G C G R-Coffee: Modifying T-Coffee at the Right Place Incorporation of Secondary Structure information within the Library Two Extra Components for the T-Coffee Scoring Scheme – – A new Library A new Scoring Scheme R-Coffee Extension TC Library C C G G G G Score X C C Score Y C C G G Goal: Embedding RNA Structures Within The T-Coffee Libraries The R-extension can be added on the top of any existing method. R-Coffee + Structural Aligners Method Avg Braliscore Net Improv. direct +T +R +T +R ----------------------------------------------------------Stemloc 0.62 0.75 0.76 104 113 Mlocarna 0.66 0.69 0.71 101 133 Murlet 0.73 0.70 0.72 -132 -73 Pmcomp 0.73 0.73 0.73 142 145 T-Lara 0.74 0.74 0.69 -36 -8 Foldalign 0.75 0.77 0.77 72 73 ----------------------------------------------------------Dyalign --0.63 0.62 ----Consan --0.79 0.79 --------------------------------------------------------------Improvement= # R-Coffee wins - # R-Coffee looses over 170 test sets R-Coffee + Regular Aligners Method Avg Braliscore Net Improv. direct +T +R +T +R ----------------------------------------------------------Poa 0.62 0.65 0.70 48 154 Pcma 0.62 0.64 0.67 34 120 Prrn 0.64 0.61 0.66 -63 45 ClustalW 0.65 0.65 0.69 -7 83 Mafft_fftnts 0.68 0.68 0.72 17 68 ProbConsRNA 0.69 0.67 0.71 -49 39 Muscle 0.69 0.69 0.73 -17 42 Mafft_ginsi 0.70 0.68 0.72 -49 39 ----------------------------------------------------------- Improvement= # R-Coffee wins - # R-Coffee looses over 388 test sets Choosing the right modeling method M-Coffee Combining Many MSAs into ONE ClustalW MAFFT T-Coffee MUSCLE ??????? Comparing Methods MAFFT Where to Trust Your Alignments Most Methods Disagree Most Methods Agree What To Do Without Structures Conclusion Model Based Alignments Give the best Accuracy Template based alignment is a very efficient way to turn Naïve aligners into model based aligners Sequence Alignments are not necessarily reliable over their entire lengths www.tcoffee.org Fabrice Armougom (CNRS, FR) Sebastien Moretti (CNRS, FR) Olivier Poirot (CNRS, FR) Frederic Reinier (CRS4, IT) Karsten Suhre (CNRS, FR) Vladimir Saudek (Sanofi-Aventis, FR) Des Higgins (UCD, IE) Orla O’Sullivan (UCD, IE) Iain Wallace (UCD, IE) Victor Jongeneel (SIB/VitalIT, CH) Bruno Nyfler (VitalIT, CH) Roger Hersch (EPFL, CH) Pierre Dumas (EPFL, CH) Basile Schaeli (EPFL, CH) www.tcoffee.org cedric.notredame@europe.com www.tcoffee.org www.tcoffee.org cedric.notredame@europe.com Building and Using Models 35.67 Angstrom Computing the Correct Alignment is a Complicated Problem Stochastic Optimization Stochastic Optimization Exploration of Complex Optimization Problems With Multiple Constraints – – Generation of Population of Suboptimal Solutions – Genomic Alignments RNA Alignments Quality=f( optimality ) Specification of Concistency Objective Function of TCoffee Three Types of Algorithms Progressive: ClustalW Iterative: Muscle Concistency Based: T-Coffee and Probcons T-Coffee and Concistency… Each Library Line is a Soft Constraint (a wish) You can’t satisfy them all You must satisfy as many as possible (The easy ones) Concistency Based Algorithms: T-Coffee Gotoh (1990) – Martin Vingron (1991) – – – Concistency Agglomerative Assembly T-Coffee (2000, Notredame) – – Dot Matrices Multiplications Accurate but too stringeant Dialign (1996, Morgenstern) – Iterative strategy using consistency Concistency Progressive algorithm ProbCons (2004, Do) – T-Coffee with a Bayesian Treatment How Good Is My Method ? Structures Vs Sequences Validation Using BaliBase T-Coffee Results Too Many Methods for ONE Alignment M-Coffee Estimating the Accuracy of your MSA What To Do Without Structures 3D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments Expresso: Finding the Right Structure Why Not Using Structure Based Alignments Template Based Multiple Sequence Alignments Template Based Multiple Sequence Alignments Sources -Structure Templates -Profile -… Template Aligner -Structure -Profile Templates -… Template Alignment Source Template Alignment Remove Templates Library Method Score Templates Prefab Homstrad -------------------------------------------------------------ClustalW Matrix ---61.80 ---Kalign Matrix ---63.00 ---MUSCLE Matrix ---68.00 45.0 -------------------------------------------------------------T-Coffee Consistency ---69.97 44.0 ProbCons Consistency ---70.54 ---Mafft Consistency ---72.20 ---M-Coffee Consistency ---72.91 ---MUMMALS Consistency ---73.10 ----------------------------------------------------------------Clustal-db Matrix Profiles ------PRALINE Matrix Profiles ---50.2 PROMALS Consistency Profiles 79.00 ---SPEM Matrix Profiles 77.00 ----------------------------------------------------------------EXPRESSO Consistency Structures ---71.9 * T-Lara Consistency Structures -------------------------------------------------------------------Table 1. Summary of all the methods described in the review. Validation figures were compiled from several sources, and selected for the compatibility. Prefab refers to some validation made on Prefab Version 3. The HOMSTRAD validation was made on datasets having less than 30% identity. The source of each figure is indicated by a reference. *The EXPRESSO figure comes from a slightly more demanding subset of HOMSTRAD (HOM39) made of sequences less than 25% identical. Improving The Evaluation How Do We Perform In The Twilight Zone? Concistency Based Methods Have an Edge Hard to tell Methods Apart Sequence Alignment is NOT solved More Than Structure based Alignments Structural Correctness Is Only the Easy Side of the Coin. In practice MSA are intermediate models used to generate other models: Data Model Type Benchmark Homology Profile Yes Evolution Trees No Structure 3D-Structure CASP Function Annotation No Conclusion Template based Multiple Sequence Alignments Need for new evaluation procedures Projecting any relevant information onto the sequences Using this Information Functional Analysis Phylogenetic Analysis Homology Search (Profiles) Homology Modelling Integrating data Making sure your bits of data can fight with one another Turning Data into Models Data Columbus, considered that the landmass occupied 225°, leaving only 135° of water (Marinus of Tyre, 70 AD). Columbus believed that 1° represented only 56 miles (Alfraganus, XIth century) He knew there was an island named Japan off the cost of China… Model Circumference of the Earth as 25,255 km at most, Canary Island to Japan : 3,700 km (Reality: 12,000 km.) The More Structures The Merrier Average Improvement over T-Coffee Struc/Seq Ratio The Right Mixt of Methods 3D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments Applications Looking-Up The DNA Behind The Sequences: PROTOGENE SAR Analysis Correlate Alignment Variations with Reactivity Application to the Human Kinome Collaboration with Sanofi-Aventis Main Issue: – Training problem Proper Benchmarking ncRNA Multiple Alignments with R-Coffee Laundering the Genome Dark Matter Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program No Plane Today… ncRNAs Comparison And ENCODE said… “nearly the entire genome may be represented in primary transcripts that extensively overlap and include many non-protein-coding regions” Who Are They? – – – – tRNA, rRNA, snoRNAs, microRNAs, siRNAs piRNAs long ncRNAs (Xist, Evf, Air, CTN, PINK…) How Many of them – – – . Open question 30.000 is a common guess Harder to detect than proteins ncRNAs can have different sequences and Similar Structures ncRNAs are Difficult to Align Same Structure Low Sequence Identity Small Alphabet, Short Sequences Alignments often NonSignificant Obtaining the Structure of a ncRNA is difficult Hard to Align The Sequences Without the Structure Hard to Predict the Structures Without an Alignment The Holy Grail of RNA Comparison: Sankoff’ Algorithm The Holy Grail of RNA Comparison Sankoff’ Algorithm Simultaneous Folding and Alignment – – In Practice, for Two Sequences: – – – – Time Complexity: O(L2n) Space Complexity: O(L3n) 50 nucleotides: 100 nucleotides 200 nucleotides 400 nucleotides 1 min. 16 min. 4 hours 3 days Forget about – – Multiple sequence alignments Database searches 6 M. 256 M. 4 G. 3 T. The next best Thing: Consan Consan = Sankoff + a few constraints Use of Stochastic Context Free Grammars – – Tree-shaped HMMs Made sparse with constraints The constraints are derived from the most confident positions of the alignment Equivalent of Banded DP Going Multiple…. Structural Aligners Game Rules Using Structural Predictions – – Produces better alignments Is Computationally expensive Use as much structural information as possible while doing as little computation as possible… Adapting T-Coffee To RNA Alignments T-Coffee and Concistency… T-Coffee and Concistency… T-Coffee and Concistency… T-Coffee and Concistency… Consistency: Conflicts and Information W X Y X Y X Z X Z Y Z Y W Y is unhappy W Z X is unhappy X X X Y Y Y Z W Z Fully Consistent More Reliable W Partly Consistent Less Reliable Z RNA Sequences Consan or Mafft / Muscle / ProbCons RNAplfold Primary Library Secondary Structures R-Coffee Extension R-Coffee Extended Primary Library R-Score Progressive Alignment Using The R-Score R-Coffee Scoring Scheme R-Score (CC)=MAX(TC-Score(CC), TC-Score (GG)) C C G G Validating R-Coffee RNA Alignments are harder to validate than Protein Alignments Protein Alignments Use of Structure based Reference Alignments RNA Alignments No Real structure based reference alignments – – The structures are mostly predicted from sequences Circularity BraliBase and the BraliScore Database of Reference Alignments 388 multiple sequence alignments. Evenly distributed between 35 and 95 percent average sequence identity Contain 5 sequences selected from the RNA family database Rfam The reference alignment is based on a SCFG model based on the full Rfam seed dataset (~100 sequences). BraliBase SPS Score RFam MSA SPS= Number of Identically Aligned Pairs Number of Aligned Pairs BraliBase: SCI Score Covariance R N A p f o l d (((…)))…((..)) DG Seq1 (((…)))…((..)) DG Seq2 (((…)))…((..)) DG Seq3 (((…)))…((..)) DG Seq4 (((…)))…((..)) DG Seq5 (((…)))…((..)) DG Seq6 RNAlifold SCI= (((…)))…((..)) ALN DG Average DG Seq X Cov DG ALN BRaliScore Braliscore= SCI*SPS RM-Coffee + Regular Aligners Method Avg Braliscore Net Improv. direct +T +R +T +R ----------------------------------------------------------Poa 0.62 0.65 0.70 48 154 Pcma 0.62 0.64 0.67 34 120 Prrn 0.64 0.61 0.66 -63 45 ClustalW 0.65 0.65 0.69 -7 83 Mafft_fftnts 0.68 0.68 0.72 17 68 ProbConsRNA 0.69 0.67 0.71 -49 39 Muscle 0.69 0.69 0.73 -17 42 Mafft_ginsi 0.70 0.68 0.72 -49 39 ----------------------------------------------------------RM-Coffee4 0.71 / 0.74 / 84 How Best is the Best…. Method vs. R-Coffee-Consan vs. RM-Coffee4 Poa 241 *** 217 *** T-Coffee 241 *** 199 *** Prrn 232 *** 198 *** Pcma 218 *** 151 *** Proalign 216 *** 150 ** Mafft fftns 206 *** 148 * ClustalW 203 *** 136 *** Probcons 192 *** 128 * Mafft ginsi 170 *** 115 Muscle 169 *** 111 M-Locarna 234 *** 183 ** Stral 169 *** 62 FoldalignM 146 61 Murlet 130 * -12 Rnasampler 129 * -27 T-Lara 125 * -30 Range of Performances Effect of Compensated Mutations Conclusion/Future Directions T-Coffee/Consan is currently the best MSA protocol for ncRNAs Testing how important is the accuracy of the secondary structure prediction Going deeper into Sankoff’s territory: predicting and aligning simultaneously Credits and Web Servers Andreas Wilm Des Higgins Sebastien Moretti Ioannis Xenarios Cedric Notredame CGR, SIB, UCD www.tcoffee.org cedric.notredame@europe.com