7. Sequence Clustering - Microbial Genome Program

advertisement
Advancing Science with DNA Sequence
Sequence Clustering
Reducing Search Space in Protein and
DNA/RNA Sequence Analysis
Denis Kaznadzey, Prokaryotic Super Program
MGM Workshop
January 30, 2011
Advancing Science with DNA Sequence
Sequence Clustering
Outline
Classification of Sequences
General Problem of Clustering
Distance Measures
Ab Initio Clustering and Pledging
Performance Considerations
Case Study: Transcriptomics
Introduction to Protein Clustering
Advancing Science with DNA Sequence
Classification as Research Tool
To deal with a huge variety of individual objects:
-
Classify into groups of essentially similar
objects
-
When new data arrives, assign objects to
existing groups
-
Classify ‘leftovers’
-
Occasionally review the entire classification
Problem: What is ‘essentially similar’?
• Finding properties that are important
(ontological relevancy)
• Does classification reflect reality in any
way?
Advancing Science with DNA Sequence
Classification
Ways to classify objects:
-
Spectral methods
-
Parametric decomposition
-
Clustering
Advancing Science with DNA Sequence
Sequence Data Abundance
In the modern biology:
The most abundant type of data is sequence:
•DNA
•
Genomic
•
Meta-Genomic
•
Environmental Samples (16S rDNA)
•RNA (cDNA libraries; RNA-Seq)
•Derived Proteins
How to compare sequences?
- Criteria depend on application, e.g. GC
content vs. order of bases.
Advancing Science with DNA Sequence
Sequence Clustering
Select Applications in Genomic Sciences:
Genome Assembly: Binning, Scaffolding
Transcriptomics: RNAseq (read)
clustering
Protein Function and Evolution studies:
Protein families
Phylogenetic profiling: OTUs
Advancing Science with DNA Sequence
Clustering is Crucial for
MetaGenomics
METAGENOMICS
•
•
•
•
Thousands of samples
Hundreds of millions reads per sample
Trillions of base pairs
Billions of genes
impossible to observe/analyze individually
Clustering becomes a strict requirement:
- Find what classes of sequences are seen
- Analyze classes rather then individual sequences
Advancing Science with DNA Sequence
MetaGenomics Analysis Tasks
Primary tasks:
Based on sequence comparison.
•
Assess diversity
•
Find genes
•
Predict functions
•
Predict pathways
•
Estimate
capabilities
Advancing Science with DNA Sequence
Sequence Clustering
Outline
Classification of Sequences
General Problem of Clustering
Distance Measures
Ab Initio Clustering and Pledging
Performance Considerations
Case Study: Transcriptomics
Introduction to Protein Clustering
Advancing Science with DNA Sequence
Clustering in General
-
Any Clustering is based on the
Distance in some Metric
-
Initial clustering is based on
pair-wise distances
-
Subsequent classification is
based on distances from
objects to clusters: Pledging
Advancing Science with DNA Sequence
Sequence Clustering
Outline
Classification of Sequences
General Problem of Clustering
Distance Measures
Ab Initio Clustering and Pledging
Performance Considerations
Case Study: Transcriptomics
Introduction to Protein Clustering
Advancing Science with DNA Sequence
Similarity Metrics
What is “similar”:
•
Similarity measure should better
reflect “reality”
•
This “reality” depends on the
application:
Measure is:
•
Assembly: find identical substrings
Identity
Percentage
•
Orthology detection: Identify
homologous proteins across
the species
Substitution
matrix based
•
Functional prediction: Identify Match to
HMM or
proteins with similar
evolutionary conserved motifs PSSM
Advancing Science with DNA Sequence
Similarity Measure
Computing similarity measure:
-
Edit distance or (ungapped) statistics P-value: BLAST,
Fasta, needle, water, etc.
-
Adjusted edit distance through progressive alignment:
Clustal, MUSCLE, T-coffee
-
K-mere statistics: CD-HIT, USEARCH, MUSCLE
-
Suffix trees (and probabilistic suffix trees): MUMmer,
Reputer, CLUSEQ
-
Suffix Arrays: Bowtie, BWT
-
Position-Specific scoring matrix: PSI-Blast, Impala
-
Hidden Markov Models: HMMer, HHSearch/HHPred, SAM
Advancing Science with DNA Sequence
Sequence Clustering
Outline
Classification of Sequences
General Problem of Clustering
Distance Measures
Ab Initio Clustering and Pledging
Performance Considerations
Case Study: Transcriptomics
Introduction to Protein Clustering
Advancing Science with DNA Sequence
Assembling Clusters
There is a HUGE variety of clustering methods (clustering /
classification is a very elaborate methodology):
•
Linkage-based
•
•
•
•
•
•
•
Average linkage
Complete linkage
Single linkage
Hierarchy-based
Fitting function-based
(K-mean)
Non-linear classifiers
(SOM, etc.)
Greedy methods
(iterative, suboptimal)
Advancing Science with DNA Sequence
Linkage-Based Clustering
Average linkage
Complete linkage
Single linkage
Advancing Science with DNA Sequence
Hierarchical Clustering
-
Build a tree
representation of
relationships
-
Cut the branches
using some
quantitative criteria
Advancing Science with DNA Sequence
Building the Tree
Criteria: More similar sequences appear at closer branches
This goal is not achievable for practical distance measures
2
C
B
3
1
2
4
A
D
4
A
B
C
D
A
B
D
Solutions:
- Approximation methods: neighbor join, UPGMA
- Search for the optimal tree by explicit criteria: (maximum
parsimony, maximal likelihood, etc.)
C
Advancing Science with DNA Sequence
Suboptimal Tree Building
Neighbor joining (corresponds to single-linkage clustering):
-
Order edges by distance
Join in order from short to long, merging branches as needed
Unweighted Pair Group Method with Arithmetic Mean
(UPGMA):(corresponds to average-linkage clustering)
For every pair of clusters (A, B), starting with all singletons:
- Compute average of distances between every object in A and every object in B
- Merge the clusters of the closest average distance
Advancing Science with DNA Sequence
Global Fitting-Function Based
K-mean clustering
-
Pre-define the number of
clusters
-
Find a distribution so that the
sum of distances to the means
is minimal
-
Computationally hard
-
Heuristics used, application
specific heuristics may be
efficient
Advancing Science with DNA Sequence
Non-Linear Methods
Self-Organizing Maps:
“self-learning” method
A neural network trained using
unsupervised learning to produce
a low-dimensional, discretized
representation of the input space
Advancing Science with DNA Sequence
Pledging
Based on distance to cluster
-
Representative
-
Set of representatives (all
at extreme)
-
Other measure, may be
unrelated to the initial one
(profile, model)
Advancing Science with DNA Sequence
Sequence Clustering
Outline
Classification of Sequences
General Problem of Clustering
Distance Measures
Ab Initio Clustering and Pledging
Performance Considerations
Case Study: Transcriptomics
Introduction to Protein Clustering
Advancing Science with DNA Sequence
Performance Considerations
Distance computing is harder than clustering
(Bundled solutions: BLASTCLUST, CD-HIT, UCLUST, CLUSEQ)
•
For large data sets only k-mere and suffix
array measures are practical
•
However: incremental/ greedy approaches can
be used to avoid entire distance matrix
computing. This makes the use of sensitive
similarity measures possible.
•
•
33 objects
528 pairs
4 groups
127 pairs
For boolean distance, iterative similarity
detection is possible (no off-the-shelf
implementations)
Binning: pre-clustering by rough and fast
methods
Advancing Science with DNA Sequence
Single Linkage is Fast
Time- and space- efficient clustering
method: transitive closure-based
• Requires ‘boolean’ distances (two sequences can be
linked or not linked
• Requires the number of nodes to be known
• Space ~ NodesNo
• Run-time (worst) ~ EdgesNo* AveClustSize
• Run-time (average) ~ EdgesNo * log2 (AveClustSize)
Advancing Science with DNA Sequence
Single Linkage is Prone to
Aggregation
Single-linkage clustering killer:
CLUSTER AGGREGATION
In large clusters, even a small number of random links
lead to huge conglomerates.
Advancing Science with DNA Sequence
Sequence Clustering
Outline
Classification of Sequences
General Problem of Clustering
Distance Measures
Ab Initio Clustering and Pledging
Performance Considerations
Case Study: Transcriptomics
Introduction to Protein Clustering
Advancing Science with DNA Sequence
Case Study: RNA-Seq Pipeline
Goals:
Reads/EST clusters
Reads/
ESTdb
1. Compute transcript
structures
Reads / clones attributed to
particular source/condition
2. Compute expression
profiles (“virtual”)
Source / condition
specific expression
profiles
Counting reads
originating from
different sources
Advancing Science with DNA Sequence
RNAseq Analysis Solutions
Source: bioinfo.org, Macquarie University, Sydney
Advancing Science with DNA Sequence
RNAseq Clustering
Approach Outline:
1. Detect identities (common
segments):
•
Compute similarities
•
Select the “good” ones
Outcome:
One biggest cluster contains
more then 60% of all sequences
(selection by better similarity
does not help)
2. Merge sequences into groups
with shared segments: SINGLE
LINKAGE
What causes aggregation and how to fight it?
Advancing Science with DNA Sequence
Aggregation in RNA-Seq Clustering
“Bad” identities:
- Pieces of vector constructs / adaptors
- Repeats
- Redundant sequences
- Spurious matches (short infrequent repeats)
- Chimeras (if pre-amplification is used)
Advancing Science with DNA Sequence
Similarities Selection
Computing ‘boolean’ distances:
• Threshold – based
• Additional rules (match arrangement)
% identity + length + arrangement:
Advancing Science with DNA Sequence
Trimming / Masking
Fighting aggregation
- Vector / adapter trimming:
-
Lucy, Figaro, etc. – integrated in many assembly suites
(newbler, velvet, AMOS, CLCbio, etc.)
- Low complexity detection / masking:
-
SEG, DUST, FastQC, WindowMasker etc. – often integrated
in search tools
Advancing Science with DNA Sequence
Repeat Elimination
Regular (tandem) repeats:
• Pre-search masking: Based on
structure (IMEx, SRF) or on
database (TRDB)
• Post-search detection based on
similarity properties (multiple
parallel threads)
Advancing Science with DNA Sequence
Repeat Elimination
Irregular (long) repeats:
• Database based:
RepeatMasker
• De-novo:
•
RepeatScout,
•
orrb,
•
PILER, etc.
(Require genome as input,
construct database)
Advancing Science with DNA Sequence
Detecting Chimeras
Detecting chimeric sequences:
• Abundance-based: Perseus, UCHIME
•
Chimeras undergo less amplification cycles. So
chimera segments in native arrangements are
more frequent
• Specific to 16S: ChimeraSlayer, Bellerophon
•
Chimera ‘arms’ are closer to originating clades
then the entire chimera
Advancing Science with DNA Sequence
Detecting Chimeras
•
Similarity coverage-based: Mira assembler
Advancing Science with DNA Sequence
Detecting Chimeras
•
Similarity graph topology-based: dchim
Alignment view
Connectivity view
Advancing Science with DNA Sequence
Sequence Clustering
Outline
Classification of Sequences
General Problem of Clustering
Distance Measures
Ab Initio Clustering and Pledging
Performance Considerations
Case Study: Transcriptomics
Introduction to Protein Clustering
Advancing Science with DNA Sequence
Protein Clustering
Direct similarity measure by edit distance is not sensitive enough
for evolutionary distant species
Position-specific scoring matrices and profile-HMMs provide better
sensitivity, but SLOW
Similar problems as for RNA-Seq/EST clustering, but their causes
are harder to fight
No ‘one fits all’ solution: manual tuning and curation required for
comprehensive results, especially at a large scale
The results of clustering are precious, they are kept as databases
(PFAM, COGs, KOGs, eggNOG)
Advancing Science with DNA Sequence
Protein Clustering at JGI
Functional annotation of metagenome genes
through protein clusters (IMG):
-
Build a set of functionally homogenous clusters of similar
proteins – for annotated genomes
-
Build HMM for each cluster, compose model database
-
Pledge metagenome proteins to clusters by matching to models
-
Cluster unpledged proteins, build models, update model
database
Advancing Science with DNA Sequence
Protein Clustering
Use of Protein Clusters reduces search space, but adds
another level of indirection, which is a source of errors,
and adds complexity that consumes effort
However, for proteins, which form dense relationship
networks, clustering is a great tool
Konstantinos Mavrommatis will elaborate on protein
clustering techniques
Advancing Science with DNA Sequence
Thank you!
Download