L - T-Coffee

advertisement
Growing Trees on the Right
Compost
Cédric Notredame
Comparative Bioinformatics Group
Bioinformatics and Genomics Program
Manguel M, Samaniego F.J.,
Abraham Wald’s Work on Aircraft Suvivability,
J. American Statistical Association. 79, 259-270, (1984)
What ‘s in a Multiple Sequence
Alignment
Selection
Important Features
Are Preserved
Evolution Inertia
Functional Constraint
Common Ancestry
Shows up
In the sequences
Phylogenetic Footprint, Evolutionary Trace …
Same Function
Same Sequence
Convergence
Why So Much Interest For Multiple
Alignments ?
Extrapolation
Structure Prediction
Motifs/Patterns
SNP Analysis
Profiles
Regulatory Elements
Phylogeny
Reactivity Analysis
What’s in a Multiple Alignment ?

The MSA contains what you put inside:
–
–
–

Structural Similarity
Evolutive Similarity
Sequence Similarity
You can view your MSA as:
–
–
–
A record of evolution
A summary of a protein family
A collection of experiments made for you by Nature…
Producing The Right Alignment

Multiple Sequence
Alignments Influence
Phylogenetic Trees

Choice of Method is not
Neutral
–
–
–

Different Methods
Different Alignments
Different Trees
Using The Right Models
insures Producing the right
Tree
Model Based Alignments vs Naïve
Alignments

Naïve Alignment
–
–
–

Model Based Alignments
–
–
–
–

Lexicographic Alignment
Maximizing the number of identities
At best using a substitution matrix
Using a model
Protein structure information
RNA Structure information
Combining/Confronting Modeling
methods
Template based Alignments
–
Model based Alignments through the
use of Templates
T-Coffee and Model Based Alignments

T-Coffee Algorithm

Expresso: Aligning Protein Structures

R-Coffee: Aligning RNA structures

M-Coffee: Combining methods
T-Coffee: An extension of the
progressive Alignment Algorithm
T-Coffee and Concistency…
SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT
SeqC GARFIELD THE VERY FAST CAT
SeqD THE FAT CAT
SeqA
SeqB
SeqC
SeqD
GARFIELD
GARFIELD
GARFIELD
--------
THE
THE
THE
THE
LAST
FAST
VERY
----
FA-T
CA-T
FAST
FA-T
CAT
--CAT
CAT
T-Coffee and Concistency…
SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT ---
Prim. Weight =88
SeqA GARFIELD THE LAST FA-T CAT
SeqC GARFIELD THE VERY FAST CAT
Prim. Weight =77
SeqA GARFIELD THE LAST FAT CAT
SeqD -------- THE ---- FAT CAT
Prim. Weight =100
SeqB GARFIELD THE ---- FAST CAT
SeqC GARFIELD THE VERY FAST CAT
Prim. Weight =100
SeqC GARFIELD THE VERY FAST CAT
SeqD -------- THE ---- FA-T CAT
Prim. Weight =100
T-Coffee and Concistency…
SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT ---
Prim. Weight =88
SeqA GARFIELD THE LAST FA-T CAT
SeqC GARFIELD THE VERY FAST CAT
Prim. Weight =77
SeqA GARFIELD THE LAST FAT CAT
SeqD -------- THE ---- FAT CAT
Prim. Weight =100
SeqB GARFIELD THE ---- FAST CAT
SeqC GARFIELD THE VERY FAST CAT
Prim. Weight =100
SeqC GARFIELD THE VERY FAST CAT
SeqD -------- THE ---- FA-T CAT
Prim. Weight =100
SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT ---
Weight =88
SeqA GARFIELD THE LAST FA-T CAT
SeqC GARFIELD THE VERY FAST CAT
SeqB GARFIELD THE ---- FAST CAT
Weight =77
SeqA GARFIELD THE LAST FA-T CAT
SeqD -------- THE ---- FA-T CAT
SeqB GARFIELD THE ---- FAST CAT
Weight =100
T-Coffee and Concistency…
SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT ---
Weight =88
SeqA GARFIELD THE LAST FA-T CAT
SeqC GARFIELD THE VERY FAST CAT
SeqB GARFIELD THE ---- FAST CAT
Weight =77
SeqA GARFIELD THE LAST FA-T CAT
SeqD -------- THE ---- FA-T CAT
SeqB GARFIELD THE ---- FAST CAT
Weight =100
T-Coffee and Concistency…
T-Coffee and Concistency…
T-Coffee and Concistency…
T-Coffee and Concistency…
When Sequences Are not Enough
3D-Coffee and Expresso
3D-Coffee:
Combining Sequences and Structures
Within Multiple Sequence Alignments
3D-Coffee:
Combining Sequences and Structures
Within Multiple Sequence Alignments
Expresso: Finding the Right Structure
Sources
BLAST
BLAST
Templates
SAP
Templates
Template Alignment
Source Template Alignment
Remove Templates
Library
3D-Coffee:
Combining Sequences and Structures
Within Multiple Sequence Alignments
Incorporating RNA Information Within
the T-Coffee Algorithm
ncRNAs Can Evolve Rapidly
A
A C CA
C
G
G
G
G
A
A
CG
G
G C
A T
A T
C G
G C
G C
A T
C G
C G
A
A C CA
C
G
G
G
G
A
A
CG
G
C G
T A
CCAGGCAAGACGGGACGAGAGTTGCCTGG
T A
G C
CCTCCGTTCAGAGGTGCATAGAACGGAGG
C G
**-------*--**---*-**------**
C G
T A
C G
C G
R-Coffee: Modifying T-Coffee at the
Right Place

Incorporation of
Secondary Structure
information within the
Library

Two Extra Components
for the T-Coffee Scoring
Scheme
–
–
A new Library
A new Scoring Scheme
R-Coffee Extension
TC Library
C
C
G
G
G G Score X
C C Score Y
C
C


G
G
Goal: Embedding RNA Structures Within The T-Coffee Libraries
The R-extension can be added on the top of any existing method.
R-Coffee + Structural Aligners
Method
Avg Braliscore
Net Improv.
direct +T
+R
+T
+R
----------------------------------------------------------Stemloc
0.62
0.75
0.76
104
113
Mlocarna
0.66
0.69
0.71
101
133
Murlet
0.73
0.70
0.72
-132
-73
Pmcomp
0.73
0.73
0.73
142
145
T-Lara
0.74
0.74
0.69
-36
-8
Foldalign
0.75
0.77
0.77
72
73
----------------------------------------------------------Dyalign
--0.63
0.62
----Consan
--0.79
0.79
--------------------------------------------------------------Improvement= # R-Coffee wins - # R-Coffee looses over 170 test sets
R-Coffee + Regular Aligners
Method
Avg Braliscore
Net Improv.
direct +T
+R
+T
+R
----------------------------------------------------------Poa
0.62
0.65
0.70
48
154
Pcma
0.62
0.64
0.67
34
120
Prrn
0.64
0.61
0.66
-63
45
ClustalW
0.65
0.65
0.69
-7
83
Mafft_fftnts
0.68
0.68
0.72
17
68
ProbConsRNA
0.69
0.67
0.71
-49
39
Muscle
0.69
0.69
0.73
-17
42
Mafft_ginsi
0.70
0.68
0.72
-49
39
-----------------------------------------------------------
Improvement= # R-Coffee wins - # R-Coffee looses over 388 test sets
Choosing the right modeling method
M-Coffee
Combining Many MSAs into ONE
ClustalW
MAFFT
T-Coffee
MUSCLE
???????
Comparing Methods
MAFFT
Where to Trust Your Alignments
Most Methods Disagree
Most Methods Agree
What To Do Without Structures
Conclusion

Model Based Alignments Give the best Accuracy

Template based alignment is a very efficient way to
turn Naïve aligners into model based aligners

Sequence Alignments are not necessarily reliable
over their entire lengths
www.tcoffee.org














Fabrice Armougom (CNRS, FR)
Sebastien Moretti (CNRS, FR)
Olivier Poirot (CNRS, FR)
Frederic Reinier (CRS4, IT)
Karsten Suhre (CNRS, FR)
Vladimir Saudek (Sanofi-Aventis, FR)
Des Higgins (UCD, IE)
Orla O’Sullivan (UCD, IE)
Iain Wallace (UCD, IE)
Victor Jongeneel (SIB/VitalIT, CH)
Bruno Nyfler (VitalIT, CH)
Roger Hersch (EPFL, CH)
Pierre Dumas (EPFL, CH)
Basile Schaeli (EPFL, CH)
www.tcoffee.org
cedric.notredame@europe.com
www.tcoffee.org
www.tcoffee.org
cedric.notredame@europe.com
Building and Using Models
35.67 Angstrom
Computing the Correct Alignment is a
Complicated Problem
Stochastic Optimization
Stochastic Optimization

Exploration of Complex
Optimization Problems With
Multiple Constraints
–
–

Generation of Population of
Suboptimal Solutions
–

Genomic Alignments
RNA Alignments
Quality=f( optimality )
Specification of Concistency
Objective Function of TCoffee
Three Types of Algorithms

Progressive: ClustalW

Iterative: Muscle

Concistency Based: T-Coffee and Probcons
T-Coffee and Concistency…

Each Library Line is a Soft Constraint (a
wish)

You can’t satisfy them all

You must satisfy as many as possible (The
easy ones)
Concistency Based Algorithms:
T-Coffee

Gotoh (1990)
–

Martin Vingron (1991)
–
–

–
Concistency
Agglomerative Assembly
T-Coffee (2000, Notredame)
–
–

Dot Matrices Multiplications
Accurate but too stringeant
Dialign (1996, Morgenstern)
–

Iterative strategy using consistency
Concistency
Progressive algorithm
ProbCons (2004, Do)
–
T-Coffee with a Bayesian Treatment
How Good Is My Method ?
Structures Vs Sequences
Validation Using BaliBase
T-Coffee Results
Too Many Methods for ONE Alignment
M-Coffee
Estimating the Accuracy of your
MSA
What To Do Without Structures
3D-Coffee:
Combining Sequences and Structures
Within Multiple Sequence Alignments
Expresso: Finding the Right Structure
Why Not Using
Structure Based
Alignments
Template Based Multiple
Sequence Alignments
Template Based Multiple Sequence
Alignments
Sources
-Structure
Templates -Profile
-…
Template
Aligner
-Structure
-Profile
Templates
-…
Template Alignment
Source Template Alignment
Remove Templates
Library
Method
Score
Templates
Prefab
Homstrad
-------------------------------------------------------------ClustalW
Matrix
---61.80
---Kalign
Matrix
---63.00
---MUSCLE
Matrix
---68.00
45.0
-------------------------------------------------------------T-Coffee
Consistency ---69.97
44.0
ProbCons
Consistency ---70.54
---Mafft
Consistency ---72.20
---M-Coffee
Consistency ---72.91
---MUMMALS
Consistency ---73.10
----------------------------------------------------------------Clustal-db
Matrix
Profiles
------PRALINE
Matrix
Profiles
---50.2
PROMALS
Consistency Profiles
79.00
---SPEM
Matrix
Profiles
77.00
----------------------------------------------------------------EXPRESSO
Consistency Structures
---71.9 *
T-Lara
Consistency Structures
-------------------------------------------------------------------Table 1. Summary of all the methods described in the review. Validation figures were compiled from several sources, and selected for the
compatibility. Prefab refers to some validation made on Prefab Version 3. The HOMSTRAD validation was made on datasets having less than 30%
identity. The source of each figure is indicated by a reference.
*The EXPRESSO figure comes from a slightly more demanding subset of HOMSTRAD (HOM39) made of sequences less than 25% identical.
Improving The Evaluation
How Do We Perform In The Twilight
Zone?



Concistency Based Methods Have an Edge
Hard to tell Methods Apart
Sequence Alignment is NOT solved
More Than Structure based Alignments

Structural Correctness Is Only the Easy Side of the Coin.

In practice MSA are intermediate models used to generate
other models:
Data
Model Type
Benchmark
Homology
Profile
Yes
Evolution
Trees
No
Structure
3D-Structure
CASP
Function
Annotation
No
Conclusion

Template based Multiple Sequence Alignments



Need for new evaluation procedures





Projecting any relevant information onto the sequences
Using this Information
Functional Analysis
Phylogenetic Analysis
Homology Search (Profiles)
Homology Modelling
Integrating data  Making sure your bits of data can fight with
one another
Turning Data into Models
Data
Columbus, considered that the landmass occupied 225°, leaving
only 135° of water (Marinus of Tyre, 70 AD).
Columbus believed that 1° represented only 56 miles (Alfraganus,
XIth century)
He knew there was an island named Japan off the cost of China…
Model
Circumference of the Earth as 25,255 km at most,
Canary Island to Japan : 3,700 km (Reality: 12,000 km.)
The More Structures The Merrier
Average
Improvement over
T-Coffee
Struc/Seq Ratio
The Right Mixt of Methods
3D-Coffee:
Combining Sequences and Structures
Within Multiple Sequence Alignments
Applications
Looking-Up The DNA Behind The
Sequences: PROTOGENE
SAR Analysis

Correlate Alignment Variations with Reactivity
Application to the Human Kinome
Collaboration with Sanofi-Aventis

Main Issue:


–
Training problem  Proper Benchmarking
ncRNA Multiple Alignments with
R-Coffee
Laundering the Genome Dark Matter
Cédric Notredame
Comparative Bioinformatics Group
Bioinformatics and Genomics Program
No Plane Today…
ncRNAs Comparison

And ENCODE said…
“nearly the entire genome may be represented in primary transcripts
that extensively overlap and include many non-protein-coding regions”

Who Are They?
–
–
–
–

tRNA, rRNA, snoRNAs,
microRNAs, siRNAs
piRNAs
long ncRNAs (Xist, Evf, Air, CTN, PINK…)
How Many of them
–
–
–
.
Open question
30.000 is a common guess
Harder to detect than proteins
ncRNAs can have different sequences
and Similar Structures
ncRNAs are Difficult to Align

Same Structure Low Sequence Identity

Small Alphabet, Short Sequences  Alignments often NonSignificant
Obtaining the Structure of a ncRNA is
difficult

Hard to Align The Sequences Without the Structure

Hard to Predict the Structures Without an Alignment
The Holy Grail of RNA Comparison:
Sankoff’ Algorithm
The Holy Grail of RNA Comparison
Sankoff’ Algorithm

Simultaneous Folding and Alignment
–
–

In Practice, for Two Sequences:
–
–
–
–

Time Complexity: O(L2n)
Space Complexity: O(L3n)
50 nucleotides:
100 nucleotides
200 nucleotides
400 nucleotides
1 min.
16 min.
4 hours
3 days
Forget about
–
–
Multiple sequence alignments
Database searches
6 M.
256 M.
4 G.
3 T.
The next best Thing: Consan

Consan = Sankoff + a few
constraints

Use of Stochastic Context Free
Grammars
–
–
Tree-shaped HMMs
Made sparse with constraints

The constraints are derived
from the most confident
positions of the alignment

Equivalent of Banded DP
Going Multiple….
Structural Aligners
Game Rules

Using Structural Predictions
–
–

Produces better alignments
Is Computationally expensive
Use as much structural information as
possible while doing as little computation
as possible…
Adapting T-Coffee
To
RNA Alignments
T-Coffee and Concistency…
T-Coffee and Concistency…
T-Coffee and Concistency…
T-Coffee and Concistency…
Consistency: Conflicts and Information
W
X
Y
X
Y
X
Z
X
Z
Y
Z
Y
W
Y is unhappy
W
Z
X is unhappy
X
X
X
Y
Y
Y
Z
W
Z
Fully Consistent

More Reliable
W
Partly Consistent

Less Reliable
Z
RNA Sequences
Consan
or
Mafft / Muscle / ProbCons
RNAplfold
Primary Library
Secondary
Structures
R-Coffee
Extension
R-Coffee Extended
Primary Library
R-Score
Progressive Alignment
Using The R-Score
R-Coffee Scoring Scheme
R-Score (CC)=MAX(TC-Score(CC), TC-Score (GG))
C
C
G
G
Validating R-Coffee
RNA Alignments are harder to validate
than Protein Alignments

Protein Alignments  Use of Structure
based Reference Alignments

RNA Alignments No Real structure based
reference alignments
–
–
The structures are mostly predicted from
sequences
Circularity
BraliBase and the BraliScore

Database of Reference Alignments

388 multiple sequence alignments.

Evenly distributed between 35 and 95 percent average
sequence identity

Contain 5 sequences selected from the RNA family database
Rfam

The reference alignment is based on a SCFG model based on
the full Rfam seed dataset (~100 sequences).
BraliBase SPS Score
RFam
MSA
SPS=
Number of Identically Aligned Pairs
Number of Aligned Pairs
BraliBase: SCI Score
Covariance
R
N
A
p
f
o
l
d
(((…)))…((..)) DG Seq1
(((…)))…((..)) DG Seq2
(((…)))…((..)) DG Seq3
(((…)))…((..)) DG Seq4
(((…)))…((..)) DG Seq5
(((…)))…((..)) DG Seq6
RNAlifold
SCI=
(((…)))…((..)) ALN DG
Average DG Seq X Cov
DG ALN
BRaliScore
Braliscore= SCI*SPS
RM-Coffee + Regular Aligners
Method
Avg Braliscore
Net Improv.
direct +T
+R
+T
+R
----------------------------------------------------------Poa
0.62
0.65
0.70
48
154
Pcma
0.62
0.64
0.67
34
120
Prrn
0.64
0.61
0.66
-63
45
ClustalW
0.65
0.65
0.69
-7
83
Mafft_fftnts
0.68
0.68
0.72
17
68
ProbConsRNA
0.69
0.67
0.71
-49
39
Muscle
0.69
0.69
0.73
-17
42
Mafft_ginsi
0.70
0.68
0.72
-49
39
----------------------------------------------------------RM-Coffee4
0.71
/
0.74
/
84
How Best is the Best….
Method
vs.
R-Coffee-Consan
vs.
RM-Coffee4
Poa
241 ***
217 ***
T-Coffee
241 ***
199 ***
Prrn
232 ***
198 ***
Pcma
218 ***
151 ***
Proalign
216 ***
150 **
Mafft fftns
206 ***
148 *
ClustalW
203 ***
136 ***
Probcons
192 ***
128 *
Mafft ginsi
170 ***
115
Muscle
169 ***
111
M-Locarna
234 ***
183 **
Stral
169 ***
62
FoldalignM
146
61
Murlet
130 *
-12
Rnasampler
129 *
-27
T-Lara
125 *
-30
Range of Performances
Effect of
Compensated
Mutations
Conclusion/Future Directions

T-Coffee/Consan is currently the best MSA
protocol for ncRNAs

Testing how important is the accuracy of the
secondary structure prediction

Going deeper into Sankoff’s territory:
predicting and aligning simultaneously
Credits and Web Servers

Andreas Wilm
Des Higgins
Sebastien Moretti
Ioannis Xenarios
Cedric Notredame

CGR, SIB, UCD




www.tcoffee.org
cedric.notredame@europe.com
Download