Advancing Science with DNA Sequence Sequence Clustering Reducing Search Space in Protein and DNA/RNA Sequence Analysis Denis Kaznadzey, Prokaryotic Super Program MGM Workshop January 30, 2011 Advancing Science with DNA Sequence Sequence Clustering Outline Classification of Sequences General Problem of Clustering Distance Measures Ab Initio Clustering and Pledging Performance Considerations Case Study: Transcriptomics Introduction to Protein Clustering Advancing Science with DNA Sequence Classification as Research Tool To deal with a huge variety of individual objects: - Classify into groups of essentially similar objects - When new data arrives, assign objects to existing groups - Classify ‘leftovers’ - Occasionally review the entire classification Problem: What is ‘essentially similar’? • Finding properties that are important (ontological relevancy) • Does classification reflect reality in any way? Advancing Science with DNA Sequence Classification Ways to classify objects: - Spectral methods - Parametric decomposition - Clustering Advancing Science with DNA Sequence Sequence Data Abundance In the modern biology: The most abundant type of data is sequence: •DNA • Genomic • Meta-Genomic • Environmental Samples (16S rDNA) •RNA (cDNA libraries; RNA-Seq) •Derived Proteins How to compare sequences? - Criteria depend on application, e.g. GC content vs. order of bases. Advancing Science with DNA Sequence Sequence Clustering Select Applications in Genomic Sciences: Genome Assembly: Binning, Scaffolding Transcriptomics: RNAseq (read) clustering Protein Function and Evolution studies: Protein families Phylogenetic profiling: OTUs Advancing Science with DNA Sequence Clustering is Crucial for MetaGenomics METAGENOMICS • • • • Thousands of samples Hundreds of millions reads per sample Trillions of base pairs Billions of genes impossible to observe/analyze individually Clustering becomes a strict requirement: - Find what classes of sequences are seen - Analyze classes rather then individual sequences Advancing Science with DNA Sequence MetaGenomics Analysis Tasks Primary tasks: Based on sequence comparison. • Assess diversity • Find genes • Predict functions • Predict pathways • Estimate capabilities Advancing Science with DNA Sequence Sequence Clustering Outline Classification of Sequences General Problem of Clustering Distance Measures Ab Initio Clustering and Pledging Performance Considerations Case Study: Transcriptomics Introduction to Protein Clustering Advancing Science with DNA Sequence Clustering in General - Any Clustering is based on the Distance in some Metric - Initial clustering is based on pair-wise distances - Subsequent classification is based on distances from objects to clusters: Pledging Advancing Science with DNA Sequence Sequence Clustering Outline Classification of Sequences General Problem of Clustering Distance Measures Ab Initio Clustering and Pledging Performance Considerations Case Study: Transcriptomics Introduction to Protein Clustering Advancing Science with DNA Sequence Similarity Metrics What is “similar”: • Similarity measure should better reflect “reality” • This “reality” depends on the application: Measure is: • Assembly: find identical substrings Identity Percentage • Orthology detection: Identify homologous proteins across the species Substitution matrix based • Functional prediction: Identify Match to HMM or proteins with similar evolutionary conserved motifs PSSM Advancing Science with DNA Sequence Similarity Measure Computing similarity measure: - Edit distance or (ungapped) statistics P-value: BLAST, Fasta, needle, water, etc. - Adjusted edit distance through progressive alignment: Clustal, MUSCLE, T-coffee - K-mere statistics: CD-HIT, USEARCH, MUSCLE - Suffix trees (and probabilistic suffix trees): MUMmer, Reputer, CLUSEQ - Suffix Arrays: Bowtie, BWT - Position-Specific scoring matrix: PSI-Blast, Impala - Hidden Markov Models: HMMer, HHSearch/HHPred, SAM Advancing Science with DNA Sequence Sequence Clustering Outline Classification of Sequences General Problem of Clustering Distance Measures Ab Initio Clustering and Pledging Performance Considerations Case Study: Transcriptomics Introduction to Protein Clustering Advancing Science with DNA Sequence Assembling Clusters There is a HUGE variety of clustering methods (clustering / classification is a very elaborate methodology): • Linkage-based • • • • • • • Average linkage Complete linkage Single linkage Hierarchy-based Fitting function-based (K-mean) Non-linear classifiers (SOM, etc.) Greedy methods (iterative, suboptimal) Advancing Science with DNA Sequence Linkage-Based Clustering Average linkage Complete linkage Single linkage Advancing Science with DNA Sequence Hierarchical Clustering - Build a tree representation of relationships - Cut the branches using some quantitative criteria Advancing Science with DNA Sequence Building the Tree Criteria: More similar sequences appear at closer branches This goal is not achievable for practical distance measures 2 C B 3 1 2 4 A D 4 A B C D A B D Solutions: - Approximation methods: neighbor join, UPGMA - Search for the optimal tree by explicit criteria: (maximum parsimony, maximal likelihood, etc.) C Advancing Science with DNA Sequence Suboptimal Tree Building Neighbor joining (corresponds to single-linkage clustering): - Order edges by distance Join in order from short to long, merging branches as needed Unweighted Pair Group Method with Arithmetic Mean (UPGMA):(corresponds to average-linkage clustering) For every pair of clusters (A, B), starting with all singletons: - Compute average of distances between every object in A and every object in B - Merge the clusters of the closest average distance Advancing Science with DNA Sequence Global Fitting-Function Based K-mean clustering - Pre-define the number of clusters - Find a distribution so that the sum of distances to the means is minimal - Computationally hard - Heuristics used, application specific heuristics may be efficient Advancing Science with DNA Sequence Non-Linear Methods Self-Organizing Maps: “self-learning” method A neural network trained using unsupervised learning to produce a low-dimensional, discretized representation of the input space Advancing Science with DNA Sequence Pledging Based on distance to cluster - Representative - Set of representatives (all at extreme) - Other measure, may be unrelated to the initial one (profile, model) Advancing Science with DNA Sequence Sequence Clustering Outline Classification of Sequences General Problem of Clustering Distance Measures Ab Initio Clustering and Pledging Performance Considerations Case Study: Transcriptomics Introduction to Protein Clustering Advancing Science with DNA Sequence Performance Considerations Distance computing is harder than clustering (Bundled solutions: BLASTCLUST, CD-HIT, UCLUST, CLUSEQ) • For large data sets only k-mere and suffix array measures are practical • However: incremental/ greedy approaches can be used to avoid entire distance matrix computing. This makes the use of sensitive similarity measures possible. • • 33 objects 528 pairs 4 groups 127 pairs For boolean distance, iterative similarity detection is possible (no off-the-shelf implementations) Binning: pre-clustering by rough and fast methods Advancing Science with DNA Sequence Single Linkage is Fast Time- and space- efficient clustering method: transitive closure-based • Requires ‘boolean’ distances (two sequences can be linked or not linked • Requires the number of nodes to be known • Space ~ NodesNo • Run-time (worst) ~ EdgesNo* AveClustSize • Run-time (average) ~ EdgesNo * log2 (AveClustSize) Advancing Science with DNA Sequence Single Linkage is Prone to Aggregation Single-linkage clustering killer: CLUSTER AGGREGATION In large clusters, even a small number of random links lead to huge conglomerates. Advancing Science with DNA Sequence Sequence Clustering Outline Classification of Sequences General Problem of Clustering Distance Measures Ab Initio Clustering and Pledging Performance Considerations Case Study: Transcriptomics Introduction to Protein Clustering Advancing Science with DNA Sequence Case Study: RNA-Seq Pipeline Goals: Reads/EST clusters Reads/ ESTdb 1. Compute transcript structures Reads / clones attributed to particular source/condition 2. Compute expression profiles (“virtual”) Source / condition specific expression profiles Counting reads originating from different sources Advancing Science with DNA Sequence RNAseq Analysis Solutions Source: bioinfo.org, Macquarie University, Sydney Advancing Science with DNA Sequence RNAseq Clustering Approach Outline: 1. Detect identities (common segments): • Compute similarities • Select the “good” ones Outcome: One biggest cluster contains more then 60% of all sequences (selection by better similarity does not help) 2. Merge sequences into groups with shared segments: SINGLE LINKAGE What causes aggregation and how to fight it? Advancing Science with DNA Sequence Aggregation in RNA-Seq Clustering “Bad” identities: - Pieces of vector constructs / adaptors - Repeats - Redundant sequences - Spurious matches (short infrequent repeats) - Chimeras (if pre-amplification is used) Advancing Science with DNA Sequence Similarities Selection Computing ‘boolean’ distances: • Threshold – based • Additional rules (match arrangement) % identity + length + arrangement: Advancing Science with DNA Sequence Trimming / Masking Fighting aggregation - Vector / adapter trimming: - Lucy, Figaro, etc. – integrated in many assembly suites (newbler, velvet, AMOS, CLCbio, etc.) - Low complexity detection / masking: - SEG, DUST, FastQC, WindowMasker etc. – often integrated in search tools Advancing Science with DNA Sequence Repeat Elimination Regular (tandem) repeats: • Pre-search masking: Based on structure (IMEx, SRF) or on database (TRDB) • Post-search detection based on similarity properties (multiple parallel threads) Advancing Science with DNA Sequence Repeat Elimination Irregular (long) repeats: • Database based: RepeatMasker • De-novo: • RepeatScout, • orrb, • PILER, etc. (Require genome as input, construct database) Advancing Science with DNA Sequence Detecting Chimeras Detecting chimeric sequences: • Abundance-based: Perseus, UCHIME • Chimeras undergo less amplification cycles. So chimera segments in native arrangements are more frequent • Specific to 16S: ChimeraSlayer, Bellerophon • Chimera ‘arms’ are closer to originating clades then the entire chimera Advancing Science with DNA Sequence Detecting Chimeras • Similarity coverage-based: Mira assembler Advancing Science with DNA Sequence Detecting Chimeras • Similarity graph topology-based: dchim Alignment view Connectivity view Advancing Science with DNA Sequence Sequence Clustering Outline Classification of Sequences General Problem of Clustering Distance Measures Ab Initio Clustering and Pledging Performance Considerations Case Study: Transcriptomics Introduction to Protein Clustering Advancing Science with DNA Sequence Protein Clustering Direct similarity measure by edit distance is not sensitive enough for evolutionary distant species Position-specific scoring matrices and profile-HMMs provide better sensitivity, but SLOW Similar problems as for RNA-Seq/EST clustering, but their causes are harder to fight No ‘one fits all’ solution: manual tuning and curation required for comprehensive results, especially at a large scale The results of clustering are precious, they are kept as databases (PFAM, COGs, KOGs, eggNOG) Advancing Science with DNA Sequence Protein Clustering at JGI Functional annotation of metagenome genes through protein clusters (IMG): - Build a set of functionally homogenous clusters of similar proteins – for annotated genomes - Build HMM for each cluster, compose model database - Pledge metagenome proteins to clusters by matching to models - Cluster unpledged proteins, build models, update model database Advancing Science with DNA Sequence Protein Clustering Use of Protein Clusters reduces search space, but adds another level of indirection, which is a source of errors, and adds complexity that consumes effort However, for proteins, which form dense relationship networks, clustering is a great tool Konstantinos Mavrommatis will elaborate on protein clustering techniques Advancing Science with DNA Sequence Thank you!