7. Sequence Clustering - Microbial Genome Program

Advancing Science with DNA Sequence
Sequence Clustering
Reducing Search Space in Protein and
DNA/RNA Sequence Analysis
Denis Kaznadzey, Prokaryotic Super Program
MGM Workshop
January 30, 2011
Advancing Science with DNA Sequence
Sequence Clustering
Classification of Sequences
General Problem of Clustering
Distance Measures
Ab Initio Clustering and Pledging
Performance Considerations
Case Study: Transcriptomics
Introduction to Protein Clustering
Advancing Science with DNA Sequence
Classification as Research Tool
To deal with a huge variety of individual objects:
Classify into groups of essentially similar
When new data arrives, assign objects to
existing groups
Classify ‘leftovers’
Occasionally review the entire classification
Problem: What is ‘essentially similar’?
• Finding properties that are important
(ontological relevancy)
• Does classification reflect reality in any
Advancing Science with DNA Sequence
Ways to classify objects:
Spectral methods
Parametric decomposition
Advancing Science with DNA Sequence
Sequence Data Abundance
In the modern biology:
The most abundant type of data is sequence:
Environmental Samples (16S rDNA)
•RNA (cDNA libraries; RNA-Seq)
•Derived Proteins
How to compare sequences?
- Criteria depend on application, e.g. GC
content vs. order of bases.
Advancing Science with DNA Sequence
Sequence Clustering
Select Applications in Genomic Sciences:
Genome Assembly: Binning, Scaffolding
Transcriptomics: RNAseq (read)
Protein Function and Evolution studies:
Protein families
Phylogenetic profiling: OTUs
Advancing Science with DNA Sequence
Clustering is Crucial for
Thousands of samples
Hundreds of millions reads per sample
Trillions of base pairs
Billions of genes
impossible to observe/analyze individually
Clustering becomes a strict requirement:
- Find what classes of sequences are seen
- Analyze classes rather then individual sequences
Advancing Science with DNA Sequence
MetaGenomics Analysis Tasks
Primary tasks:
Based on sequence comparison.
Assess diversity
Find genes
Predict functions
Predict pathways
Advancing Science with DNA Sequence
Sequence Clustering
Classification of Sequences
General Problem of Clustering
Distance Measures
Ab Initio Clustering and Pledging
Performance Considerations
Case Study: Transcriptomics
Introduction to Protein Clustering
Advancing Science with DNA Sequence
Clustering in General
Any Clustering is based on the
Distance in some Metric
Initial clustering is based on
pair-wise distances
Subsequent classification is
based on distances from
objects to clusters: Pledging
Advancing Science with DNA Sequence
Sequence Clustering
Classification of Sequences
General Problem of Clustering
Distance Measures
Ab Initio Clustering and Pledging
Performance Considerations
Case Study: Transcriptomics
Introduction to Protein Clustering
Advancing Science with DNA Sequence
Similarity Metrics
What is “similar”:
Similarity measure should better
reflect “reality”
This “reality” depends on the
Measure is:
Assembly: find identical substrings
Orthology detection: Identify
homologous proteins across
the species
matrix based
Functional prediction: Identify Match to
HMM or
proteins with similar
evolutionary conserved motifs PSSM
Advancing Science with DNA Sequence
Similarity Measure
Computing similarity measure:
Edit distance or (ungapped) statistics P-value: BLAST,
Fasta, needle, water, etc.
Adjusted edit distance through progressive alignment:
Clustal, MUSCLE, T-coffee
K-mere statistics: CD-HIT, USEARCH, MUSCLE
Suffix trees (and probabilistic suffix trees): MUMmer,
Reputer, CLUSEQ
Suffix Arrays: Bowtie, BWT
Position-Specific scoring matrix: PSI-Blast, Impala
Hidden Markov Models: HMMer, HHSearch/HHPred, SAM
Advancing Science with DNA Sequence
Sequence Clustering
Classification of Sequences
General Problem of Clustering
Distance Measures
Ab Initio Clustering and Pledging
Performance Considerations
Case Study: Transcriptomics
Introduction to Protein Clustering
Advancing Science with DNA Sequence
Assembling Clusters
There is a HUGE variety of clustering methods (clustering /
classification is a very elaborate methodology):
Average linkage
Complete linkage
Single linkage
Fitting function-based
Non-linear classifiers
(SOM, etc.)
Greedy methods
(iterative, suboptimal)
Advancing Science with DNA Sequence
Linkage-Based Clustering
Average linkage
Complete linkage
Single linkage
Advancing Science with DNA Sequence
Hierarchical Clustering
Build a tree
representation of
Cut the branches
using some
quantitative criteria
Advancing Science with DNA Sequence
Building the Tree
Criteria: More similar sequences appear at closer branches
This goal is not achievable for practical distance measures
- Approximation methods: neighbor join, UPGMA
- Search for the optimal tree by explicit criteria: (maximum
parsimony, maximal likelihood, etc.)
Advancing Science with DNA Sequence
Suboptimal Tree Building
Neighbor joining (corresponds to single-linkage clustering):
Order edges by distance
Join in order from short to long, merging branches as needed
Unweighted Pair Group Method with Arithmetic Mean
(UPGMA):(corresponds to average-linkage clustering)
For every pair of clusters (A, B), starting with all singletons:
- Compute average of distances between every object in A and every object in B
- Merge the clusters of the closest average distance
Advancing Science with DNA Sequence
Global Fitting-Function Based
K-mean clustering
Pre-define the number of
Find a distribution so that the
sum of distances to the means
is minimal
Computationally hard
Heuristics used, application
specific heuristics may be
Advancing Science with DNA Sequence
Non-Linear Methods
Self-Organizing Maps:
“self-learning” method
A neural network trained using
unsupervised learning to produce
a low-dimensional, discretized
representation of the input space
Advancing Science with DNA Sequence
Based on distance to cluster
Set of representatives (all
at extreme)
Other measure, may be
unrelated to the initial one
(profile, model)
Advancing Science with DNA Sequence
Sequence Clustering
Classification of Sequences
General Problem of Clustering
Distance Measures
Ab Initio Clustering and Pledging
Performance Considerations
Case Study: Transcriptomics
Introduction to Protein Clustering
Advancing Science with DNA Sequence
Performance Considerations
Distance computing is harder than clustering
For large data sets only k-mere and suffix
array measures are practical
However: incremental/ greedy approaches can
be used to avoid entire distance matrix
computing. This makes the use of sensitive
similarity measures possible.
33 objects
528 pairs
4 groups
127 pairs
For boolean distance, iterative similarity
detection is possible (no off-the-shelf
Binning: pre-clustering by rough and fast
Advancing Science with DNA Sequence
Single Linkage is Fast
Time- and space- efficient clustering
method: transitive closure-based
• Requires ‘boolean’ distances (two sequences can be
linked or not linked
• Requires the number of nodes to be known
• Space ~ NodesNo
• Run-time (worst) ~ EdgesNo* AveClustSize
• Run-time (average) ~ EdgesNo * log2 (AveClustSize)
Advancing Science with DNA Sequence
Single Linkage is Prone to
Single-linkage clustering killer:
In large clusters, even a small number of random links
lead to huge conglomerates.
Advancing Science with DNA Sequence
Sequence Clustering
Classification of Sequences
General Problem of Clustering
Distance Measures
Ab Initio Clustering and Pledging
Performance Considerations
Case Study: Transcriptomics
Introduction to Protein Clustering
Advancing Science with DNA Sequence
Case Study: RNA-Seq Pipeline
Reads/EST clusters
1. Compute transcript
Reads / clones attributed to
particular source/condition
2. Compute expression
profiles (“virtual”)
Source / condition
specific expression
Counting reads
originating from
different sources
Advancing Science with DNA Sequence
RNAseq Analysis Solutions
Source: bioinfo.org, Macquarie University, Sydney
Advancing Science with DNA Sequence
RNAseq Clustering
Approach Outline:
1. Detect identities (common
Compute similarities
Select the “good” ones
One biggest cluster contains
more then 60% of all sequences
(selection by better similarity
does not help)
2. Merge sequences into groups
with shared segments: SINGLE
What causes aggregation and how to fight it?
Advancing Science with DNA Sequence
Aggregation in RNA-Seq Clustering
“Bad” identities:
- Pieces of vector constructs / adaptors
- Repeats
- Redundant sequences
- Spurious matches (short infrequent repeats)
- Chimeras (if pre-amplification is used)
Advancing Science with DNA Sequence
Similarities Selection
Computing ‘boolean’ distances:
• Threshold – based
• Additional rules (match arrangement)
% identity + length + arrangement:
Advancing Science with DNA Sequence
Trimming / Masking
Fighting aggregation
- Vector / adapter trimming:
Lucy, Figaro, etc. – integrated in many assembly suites
(newbler, velvet, AMOS, CLCbio, etc.)
- Low complexity detection / masking:
SEG, DUST, FastQC, WindowMasker etc. – often integrated
in search tools
Advancing Science with DNA Sequence
Repeat Elimination
Regular (tandem) repeats:
• Pre-search masking: Based on
structure (IMEx, SRF) or on
database (TRDB)
• Post-search detection based on
similarity properties (multiple
parallel threads)
Advancing Science with DNA Sequence
Repeat Elimination
Irregular (long) repeats:
• Database based:
• De-novo:
PILER, etc.
(Require genome as input,
construct database)
Advancing Science with DNA Sequence
Detecting Chimeras
Detecting chimeric sequences:
• Abundance-based: Perseus, UCHIME
Chimeras undergo less amplification cycles. So
chimera segments in native arrangements are
more frequent
• Specific to 16S: ChimeraSlayer, Bellerophon
Chimera ‘arms’ are closer to originating clades
then the entire chimera
Advancing Science with DNA Sequence
Detecting Chimeras
Similarity coverage-based: Mira assembler
Advancing Science with DNA Sequence
Detecting Chimeras
Similarity graph topology-based: dchim
Alignment view
Connectivity view
Advancing Science with DNA Sequence
Sequence Clustering
Classification of Sequences
General Problem of Clustering
Distance Measures
Ab Initio Clustering and Pledging
Performance Considerations
Case Study: Transcriptomics
Introduction to Protein Clustering
Advancing Science with DNA Sequence
Protein Clustering
Direct similarity measure by edit distance is not sensitive enough
for evolutionary distant species
Position-specific scoring matrices and profile-HMMs provide better
sensitivity, but SLOW
Similar problems as for RNA-Seq/EST clustering, but their causes
are harder to fight
No ‘one fits all’ solution: manual tuning and curation required for
comprehensive results, especially at a large scale
The results of clustering are precious, they are kept as databases
(PFAM, COGs, KOGs, eggNOG)
Advancing Science with DNA Sequence
Protein Clustering at JGI
Functional annotation of metagenome genes
through protein clusters (IMG):
Build a set of functionally homogenous clusters of similar
proteins – for annotated genomes
Build HMM for each cluster, compose model database
Pledge metagenome proteins to clusters by matching to models
Cluster unpledged proteins, build models, update model
Advancing Science with DNA Sequence
Protein Clustering
Use of Protein Clusters reduces search space, but adds
another level of indirection, which is a source of errors,
and adds complexity that consumes effort
However, for proteins, which form dense relationship
networks, clustering is a great tool
Konstantinos Mavrommatis will elaborate on protein
clustering techniques
Advancing Science with DNA Sequence
Thank you!