Slide 1 - Ron Shamir`s Computational Genomics Group

advertisement
Clustal Ω for Protein Multiple Sequence Alignment
Des Higgins (Conway Institute, University College Dublin,
Ireland), “Clustal Omega for Protein Multiple Sequence
Alignment,” presentation at ISMB/ECCB 2011.
Sievers et al., “Fast, scalable generation of high quality protein
multiple sequence alignments using Clustal Omega,”
unpublished manuscript, 2011.
Presented by Hershel Safer in Ron Shamir’s group meeting on
17.8.2011.
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 1
Outline
Background on multiple sequence alignment (MSA)
Considerations for a new MSA tool
Clustal Ω
Benchmarking: Methods and issues
Benchmarking results
References
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 2
Example of MSA: Globins
From Higgins 2011
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 3
Example continued: Red columns are alpha helices
From Higgins 2011
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 4
Approaches to finding MSAs
Exact solution using dynamic programming: Finding “optimal”
MSA for N sequences of length L takes time O(LN)
Progressive alignment: Greedy heuristic that mimics evolution.
• Start by creating guide tree that specifies “evolutionary
closeness.” Complexity is O(N2) for fixed L.
• Build increasingly large sub-alignments in the order specified
by the guide tree. Complexity is O(N).
• Works for up to a few thousand sequences
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 5
Example of progressive alignment
From Higgins 2011
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 6
Example of progressive alignment, cont’d.
From Higgins 2011
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 7
Example of progressive alignment, cont’d.
From Higgins 2011
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 8
Features of progressive alignment
Advantages
• Fast
• Gives pretty good results on large problems
• Provides good basis for manual tweaking
Disadvantages
• Hard to know if a solution is good – no objective function
• Errors are not corrected. Once two sequences are aligned,
they keep the same relative alignment (e.g., later indels
apply identically to both sequences).
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 9
Consistency criterion
Addresses problem of errors introduced by early mis-alignments
Use library of pairwise alignments that is created for building the
guide tree
For each pair of aligned residues in the library, check their
alignment in other pairwise alignments.
Scores for progressive alignment are modified to reflect
consistency across the entire library of pairwise comparisons.
Helps avoid early mis-alignment.
Complexity: worst case O(N3L2), in practice O(N3L).
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 10
Two kinds of popular MSA tools
Fast (<10,000 sequences)
• Clustal W
• MAFFT (with --partree, can handle >>10,000 sequences)
• Muscle
• Kalign
Accurate but slow (<100s of sequences)
• T-Coffee
• ProbCons
• MSAProbs
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 11
Outline
Background on multiple sequence alignment (MSA)
Considerations for a new MSA tool
Clustal Ω
Benchmarking: Methods and issues
Benchmarking results
References
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 12
Why a new MSA tool?
Starting to see uses for MSAs with hundreds of thousands of
sequences
• Metagenomics
• Next-generation sequencing
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 13
Goals for a new MSA tool
Want a tool that scales well (time and space) to hundreds of
thousands of sequences and still gives accurate results
Scalability: Up to several hours to align hundreds of thousands of
sequences on a desktop computer
Accuracy: Similar to Clustal W
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 14
Outline
Background on multiple sequence alignment (MSA)
Considerations for a new MSA tool
Clustal Ω
Benchmarking: Methods and issues
Benchmarking results
References
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 15
Clustal Ω: Possibly the last MSA tool you will need
Building guide tree: Use mBed to cluster in time O(N log(N))
Progressive alignment: Use HHalign to sequentially align pairs of
profile HMMs
Take advantage of existing alignments
• External profile alignment: Use an existing profile HMM of
sequences homologous to input set to help align input set
• Iterate guide tree construction and/or progressive alignment
• Add sequences to existing alignments without starting from
scratch
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 16
Building guide tree using mBed
Reduces quadratic time/space of clustering and guide-tree
construction to O(N log(N))
1. Cluster sequences
a. Select log2(N) seed sequences
b. Compute distance from each sequence to all seeds, using
k-tuple distance measure (k=2) for unaligned sequences.
c. Cluster sequences using k-means
2. Build guide tree
a. Construct UPGMA sub-tree separately for each cluster
(use UPGMA code from Muscle)
b. Link sub-trees using distances between clusters
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 17
Progressive alignment using HHalign
HHalign is a method for pairwise alignment of profile HMMs
It was designed to search HMM databases to identify remote
homologs (sequence identity <20%)
In Clustal Ω, sequences and sub-alignments are converted to
profile HMMs. Transition, insertion, and deletion probabilities
are computed, and pseudo-counts are added as needed.
HHalign is used to align sub-alignments, in the order defined by
the guide tree.
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 18
External profile alignment (EPA)
Take advantage of existing HMMs to guide pairwise alignment in
early stages – avoid seemingly good alignments that are bad in
the context of the entire MSA
If the kinds of sequences are known, can often find a relevant
HMM in Pfam.
Contribution of external profile decreases as sub-alignments get
larger, as larger sub-alignments contain the information that
would come from the external profile.
Overhead: Can triple the alignment time
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 19
EPA performance
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 20
Iteration instead of EPA
Can bootstrap profile information if external profile is not
available or not desired
MSA of original sequences can be converted to HMM and used
as in EPA
MSA can also be used to rebuild guide tree
Can iterate this process
Can decouple iteration of guide-tree construction and HMM
construction – can freeze one and just iterate the other, or
iterate both
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 21
Iteration performance
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 22
Availability of Clustal Ω
Download a copy (Unix/Linux, Windows, Mac)
EBI website
Galaxy analysis system (coming soon?)
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 23
Outline
Background on multiple sequence alignment (MSA)
Considerations for a new MSA tool
Clustal Ω
Benchmarking: Methods and issues
Benchmarking results
References
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 24
Benchmark databases for MSA
BAliBASE
• Collection of manually refined MSAs based on 3D structural
superposition
• Annotated core blocks: highly conserved regions that can be
reliably aligned
• Occasionally updated to represent kinds of complex
sequences encountered in real problems, as kinds of
alignments attempted change.
• Divided into reference sets that represent different kinds of
alignment challenges
Other MSA benchmark DBs: Prefab, Homstrad, Oxbench,
SABmark, IRMbase
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 25
Clustal Ω benchmarking approach
Compared to 11 other MSA programs
Score is fraction of columns identical in generated and reference
alignments
Used 3 benchmark databases
• BAliBASE: Consider only core regions of alignments
• Prefab
• HomFam: Created for this work to test scalability to many
sequences. Combined Homstrad families with corresponding
Pfam families. Only tested with “fast” tools.
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 26
Problems with benchmarking databases
DBs include questionable alignments
DBs have biased coverage of fold families and kinds of proteins
Test results may be biased if similar methods used to construct
DB and in MSA tool (e.g., pairwise alignment method)
Focus on core blocks over-estimates accuracy because these
regions are more easily aligned
Including gaps is problematic: Gap position is not considered,
and a misplaced gap can improve the accuracy score.
Amount of sequence divergence in DB alignments: twilight zone
(20-35% identity) vs. higher or lower
Sum-of-pairs vs. column scores
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 27
Problems with benchmarking databases, cont’d.
How representative is the benchmark?
• Method may behave well on benchmark, not in real world
• Method may behave well in real world, not on benchmark
Conclusion of Edgars: “protein alignment assessment is more
challenging than generally realized, and skepticism is appropriate
for claims that method rankings or advances can be reliably
measured by current benchmarks.”
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 28
Outline
Background on multiple sequence alignment (MSA)
Considerations for a new MSA tool
Clustal Ω
Benchmarking: Methods and issues
Benchmarking results
References
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 29
BAliBASE benchmark
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 30
Prefab benchmark
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 31
HomFam benchmark
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 32
Scalability of running time
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 33
Outline
Background on multiple sequence alignment (MSA)
Considerations for a new MSA tool
Clustal Ω
Benchmarking: Methods and issues
Benchmarking results
References
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 34
Additional references
Notredame et al. (2002), “T-Coffee: A novel method for fast and
accurate multiple sequence alignment,” J Mol Biol 302:205.
[Introduced notion of consistency]
Blackshields et al. (2010), “Sequence embedding for fast
construction of guide trees for multiple sequence alignment,”
Algorithms for Mol Biol 5:21. [mBed algorithm]
Söding (2005), “Protein homology detection by HMM-HMM
comparison,” Bioinformatics 21:951. [HHalign algorithm]
Thompson et al. (2005), “BAliBASE 3.0: Latest developments of
the multiple sequence alignment benchmark,” Proteins 61:127.
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 35
Additional references, cont’d.
Mizuguchi et al. (1998), “HOMSTRAD: A database of protein
structure alignments for homologous families,” Protein Sci
7:2469.
Edgar (2004), “MUSCLE: Multiple sequence alignment with high
accuracy and high throughput,” Nucleic Acids Res 32:1792.
[Introduced PREFAB benchmarking DB]
Edgar (2010), “Quality measures for protein alignment
benchmarks,” Nucleic Acids Res 38:2145.
Aniba et al. (2010), “Issues in bioinformatics benchmarking: The
case study of multiple sequence alignment,” Nucleic Acids Res
38:7353.
Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer
17 August 2011
Page 36
Download