Classification vs. Clustering • Data Mining Lecture Notes • Dr. Yousef Qawqzeh Clustering and Classification Concepts Bayesian Classification Clustering Algorithms Applications Classifying Mitochondrial Proteins Clustering Gene Expression Data Lab Practical Clustering and Classification of Gene Expression Data TB Expression Data Compendium (Boshoff et al) • Objects characterized by one or more features • Classification – Have labels for some points – Want a “rule” that will accurately assign labels to new points – Supervised learning FeatureinYExp 2 Expression Conservation The Basic Idea Proteins Genes Expression Co-Expression FeatureinXExp 1 • Clustering – No labels – Group points into clusters based on how “near” they are to one another – Identify structure in data – Unsupervised learning Clustering and Classification in Genomics • Classification Microarray data: classify cell state (i.e. AML vs ALL) using expression data Protein/gene sequences: predict function, localization, etc. • Clustering Microarray data: groups of genes that share similar function have similar expression patterns – identify regulons Protein sequence: group related proteins to infer function EST data: collapse redundant sequences Classification Two Different Approaches • Generative – Bayesian Classification and Naïve Bayes – Example: Mitochondrial Protein Prediction • Discriminative – Support Vector Machines Bayesian Classification We will pose the classification problem in probabilistic terms Create models for how features are distributed for objects of different classes We will use probability calculus to make classification decisions Classifying Mitochondrial Proteins Derive 7 features for all human proteins Targeting signal Protein domains Co-expression Mass Spec Homology Induction Motifs Predict nuclear encoded mitochondrial genes Maestro • Each object can be associated with multiple features • We will look at the case of just one feature for now Conservation Lets Look at Just One Feature Proteins Co-Expression We are going to define two key concepts…. The First Key Concept Features for each class drawn from class-conditional probability distributions (CCPD) P(X|Class1) P(X|Class2) X Our first goal will be to model these distributions This Seems Familiar…. Modeling Pathogenicity Islands P(Si|MP) is a class-conditional probability distribution P(Si|MP) Objects: Class: Feature: Nucleotides Pathogenicity Island A,T,G,C A: 0.15 T: 0.13 G: 0.30 C: 0.42 Makes sense: we built a simple classifier for sequences (MP vs B) The Second Key Concept We model prior probabilities to quantify the expected a priori chance of seeing a class P(Class2) & P(Class1) P(mito) = how likely is the next protein to be a mitochondrial protein before I see any features to help me decide We expect ~1500 mitochondrial genes out of ~21000 total, so P(mito)=1500/21000 P(~mito)=19500/21000 But How Do We Classify? • So we have priors defining the a priori probability of a class P(Class1), P(Class2) • We also have models for the probability of a feature given each class P(X|Class1), P(X|Class2) But we want the probability of the class given a feature How do we get P(Class1|X)? Bayes Rule Evaluate Likelihood evidence Belief Prior before evidence P( Feature | Class) P(Class) P(Class | Feature) P( Feature) Posterior Belief after evidence Evidence Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 Bayes Decision Rule If we observe an object with feature X, how do decide if the object is from Class 1? The Bayes Decision Rule is simply choose Class1 if: P(Class1| X ) P(Class 2 | X ) P( X | Class1) P( L1) P( X | Class 2) P( L2) P( X ) P( X ) P( X This | Class P(Class P( X |on Class 2) P (Class 2) is1) the same1)number both sides! Discriminant Function We can create a convenient representation of the Bayes Decision Rule P( X | Class1) P (Class1) P ( X | Class 2) P (Class 2) P ( X | Class1) P (Class1) 1 P ( X | Class 2) P (Class 2) G ( X ) log P( X | Class1) P (Class1) 0 P ( X | Class 2) P (Class 2) If G(X) > 0, we classify as Class 1 Stepping back What do we have so far? We have defined the two components, class-conditional distributions and priors P(X|Class1), P(X|Class2) P(Class1), P(Class2) We have used Bayes Rule to create a discriminant function for classification from these components G ( X ) log P( X | Class1) P(Class1) 0 P( X | Class 2) P(Class 2) Given a new feature, X, we plug it into this equation… …and if G(X)> 0 we classify as Class1 Flashback to First Lecture We defined a scoring function for classifying sequences as pathogenicity islands or background DNA Can we justify this choice in Bayesian terms? Score log Pathogenicity Islands P( S | MB) P( Sequence | PathogenicityIsland ) log P( S | B) P( Sequence | BackgroundDNA) A: 0.42 T: 0.30 Background DNA A: 0.25 T: 0.25 G: 0.13 G: 0.25 C: 0.15 C: 0.25 From the Bayesian Approach We can write out our Bayes discriminant function: Bayes Rule P( MB | S ) P( S | MB) P( MB) G ( S ) log log P( B | S ) P( S | B) P( B) P( S | MB) log If P(MB)/P(B)=1 P(MB)=P(B) P( S | B) Log-likehood ratio is Bayes classifier assuming equal priors Unequal Priors Bayes rule lets us correct for the fact that islands are not as common as background DNA P( S | MB) P( MP) P( S | MB) 0.15 log log P( S | B) P( B) P( S | B) 0.85 Scoreoriginal 0.75 G ( X ) log We require a higher score to call a pathogenicity island (The HMM implicitly accounted for this in the transition probabilities) Pathogenicity Islands Background DNA A: 0.42 T: 0.30 G: 0.13 C: 0.15 15% A: 0.25 T: 0.25 G: 0.25 C: 0.25 85% Back to Classification We have two fundamental tasks • We need to estimate the needed probability distributions – P(X|Mito) and P(x|~Mito) – P(Mito) and P(~Mito) • We need to assess the accuracy of the classifier – How well does it classify new objects The All Important Training Set Building a classifier requires a set of labeled data points called the Training Set The quality of the classifier depends on the number of training set data points How many data points you need depends on the problem Need to build and test your classifier Getting P(X|Class) from Training Set P(X|Class1) One Simple Approach How do we get this Divide X values into bins from these? There are 13 data points And then we simply count frequencies Should feel familiar from last lecture (i.e. Maximum Likelihood) X In general, and especially for continuous distributions, 7/13 this can be a complicated problem 3/13 Density Estimation 2/13 1/13 0 <1 1-3 3-5 5-7 >7 Getting Priors Three general approaches 1. Estimate priors by counting fraction of classes in training set P(Class1)=13/23 P(Class2)=10/23 13 Class1 2. 10 Class2 Estimate from “expert” But sometimes fractions in training set are not knowledge Example representative of world P(mito)=1500/21000 P(~mito)=19500/21000 3. We have no idea – use equal (uninformative) priors P(Class1)=P(Class2) We Are Just About There…. We have created the class-conditional distributions and priors P(X|Class1), P(X|Class2) P(Class1), P(Class2) And we are ready to plug these into our discriminant function P( X | Class1) P(Class1) G ( X ) log 0 P( X | Class 2) P(Class 2) But there is one more little complication….. But What About Multiple Features? • We have focused on a single feature for an object • But mitochondrial protein prediction (for example) has 7 features Targeting signal Protein domains Co-expression Mass Spec Homology Induction Motifs So P(X|Class) become P(X1,X2,X3,…,X8|Class) and our discriminant function becomes P( X 1 , X 2 ,..., X 7 | Class1) P(Class1) G( X ) log 0 P( X 1 , X 2 ,..., X 7 | Class 2) P(Class 2) Distributions Over Many Features Estimating P(X1,X2,X3,…,X8|Class1) can be difficult • Assume each feature binned into 5 possible values • We have 58 combinations of values we need to count the frequency for • Generally will not have enough data – We will have lots of nasty zeros Naïve Bayes Classifier We are going to make the following assumption: All features are independent given the class P( X 1 , X 2 ,..., X n | Class ) P( X 1 | Class ) P( X 2 | Class )...P( X n | Class) n P( X i | Class ) i 1 We can thus estimate individual distributions for each feature and just multiply them together! Naïve Bayes Discriminant Function Thus, with the Naïve Bayes assumption, we can now rewrite, this: G( X 1 ,..., X 7 ) log P( X 1 , X 2 ,..., X 7 | Class1) P(Class1) 0 P( X 1 , X 2 ,..., X 7 | Class 2) P(Class 2) As this: G ( X 1 ,..., X 7 P( X | Class1) P(Class1) ) log 0 P( X | Class 2) P(Class2) i i Individual Feature Distributions Instead of a single big distribution, we have a smaller one for each feature (and class) P(Target|Mito) P(Target|~Mito) Targeting signal P(Domain|Mito) P(Domain|~Mito) Protein domains P(CE|Mito) P(CE|~Mito) Co-expression P(Mass|Mito) P(Mass|~Mito) Mass Spec P(Homology|Mito) P(Homology|~Mito) Homology P(Induc|Mito) P(Induc|~Mito) P(Motif|Mito) P(Motif|~Mito) Target 7/13 2/13 1/13 Motifs 0 <1 Induction 3/13 1-3 3-5 5-7 >7 Classifying A New Protein Targeting signal Protein domains Co-expression Xi Mass Spec Homology Induction P(Xi|Mito) P(Xi|~Mito) (for all 8 features) Motifs Plug these and priors into the discriminant function G ( X 1 ,..., X 7 P( X | Mito) P( Mito) ) log 0 P( X |~ Mito) P(~ Mito) i i IF G>0, we predict that the protein is from class Mito Maestro Results Apply Maestro to Human Proteome Total predictions: 1,451 proteins 490 novel predictions Slide Credit: S. Calvo How Good is the Classifier? The Rule We must test our classifier on a different set from the training set: the labeled test set The Task We will classify each object in the test set and count the number of each type of error Binary Classification Errors Predicted True Predicted False True (Mito) False (~Mito) TP FN FP TN Sensitivity = TP/(TP+FN) Specificity = TN/(TN+FP) • Sensitivity – Fraction of all Class1 (True) that we correctly predicted at Class 1 – How good are we at finding what we are looking for • Specificity – Fraction of all Class 2 (False) called Class 2 – How many of the Class 2 do we filter out of our Class 1 predictions In both cases, the higher the better Maestro Outperforms Existing Classifiers Naïve Bayes (Maestro) (99%, 71%) ** Slide Credit: S. Calvo Clustering Expression Data Clustering Expression Data – Group by similar expression profiles Gene 2 • Cluster Experiments Experiment • Cluster Genes – Group by similar expression in different conditions Experiment 2 Gene 1 Genes Experiment 1 Why Cluster Genes by Expression? • Organize data – Explore without getting lost in each data point – Enhance visualization GCN4 • Co-regulated Genes – Common expression may imply common regulation – Predict cis-regulatory promoter sequences • Functional Annotation – Similar function from similar expression His2 His3 Unknown Amino Acids Amino Acids Amino Acids Clustering Methods There are many clustering methods, we will look briefly at two 1. 2. K-means - Partitioning Method Hierarchical clustering - Agglomerative Method K-Means Clustering The Basic Idea • Assume a fixed number of clusters, K • Goal: create “compact” clusters K-Means Algorithm • Randomly Initialize Clusters Slide credit: Serafim Batzoglou K-Means Algorithm • Randomly Initialize Clusters • Assign data points to nearest clusters Slide credit: Serafim Batzoglou K-Means Algorithm • Randomly Initialize Clusters • Assign data points to nearest clusters • Recalculate Clusters Slide credit: Serafim Batzoglou K-Means Algorithm • Randomly Initialize Clusters • Assign data points to nearest clusters • Recalculate Clusters Slide credit: Serafim Batzoglou K-Means Algorithm • Randomly Initialize Clusters • Assign data points to nearest clusters • Recalculate Clusters • Repeat… Slide credit: Serafim Batzoglou K-Means Algorithm • Randomly Initialize Clusters • Assign data points to nearest clusters • Recalculate Clusters • Repeat… Slide credit: Serafim Batzoglou K-Means Algorithm • Randomly Initialize Clusters • Assign data points to nearest clusters • Recalculate Clusters • Repeat…until convergence Slide credit: Serafim Batzoglou (Dis)Similarity Measures D’haeseleer (2005) Nat Biotech But How many clusters? • How do we select K? – We can always make clusters “more compact” by increasing K – e.g. What happens is if K=number of data points? – What is a meaningful improvement? • Hierarchical clustering side-steps this issue Hierarchical clustering Most widely used algorithm for expression data c a • Start with each point in a separate cluster • At each step: b d h e f g – Choose the pair of closest clusters – Merge Phylogeny (UMPGA) a b d e f c g h slide credits: M. Kellis Visualization of results Hierarchical clustering Produces clusters for all possible numbers of clusters c a b From 1 to number of data points d h e f We can always select a “cut level” to create disjoint clusters g But how do we define distances between clusters? a b d e f c g h slide credits: M. Kellis Distance between clusters • CD(X,Y)=minx X, y Y D(x,y) Single-link method d h e f g • CD(X,Y)=maxx X, y Y D(x,y) Complete-link method • CD(X,Y)=avgx X, y Y D(x,y) Average-link method • CD(X,Y)=D( avg(X) , avg(Y) ) Centroid method d h e f g d h e f g d h e f g slide credits: M. Kellis What does clustered data look like? Evaluating Cluster Performance In general, it depends on your goals in clustering • Robustness – Select random samples from data set and cluster – Repeat – Robust clusters show up in all clusters • Category Enrichment – Look for categories of genes “over-represented” in particular clusters Next Lecture – Also used in Motif Discovery Clusters and Motif Discovery Expression from 15 time points during yeast cell cycle Tavazoie & Church (1999) Bringing Clustering and Classification Together Semi-Supervised Learning Common Scenario • Few labeled • Many unlabeled • Structured data What if we cluster first? Then clusters can help us classify Tomorrow’s Lab • Expression Clustering and Classification – SVM – Cross-Validation – Matlab • TB Expression Compendium