Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute Plan for the Day Announcements Questions? No quiz and no new assignment today Weka helpful hints Clustering Advanced Statistical Models More on Optimization and Tuning Weka Helpful Hints Remember SMOreg vs SMO… Setting the Exponent in SMO * Note that an exponent larger than 1.0 means you are using a non-linear kernel. Clustering What is clustering Finding natural groupings of your data Not supervised! No class attribute. Usually only works well if you have a huge amount of data! InfoMagnets: Interactive Text Clustering What does clustering do? Finds natural breaks in your data If there are obvious clusters, you can do this with a small amount of data If you have lots of weak predictors, you need a huge amount of data to make it work What does clustering do? Finds natural breaks in your data If there are obvious clusters, you can do this with a small amount of data If you have lots of weak predictors, you need a huge amount of data to make it work Clustering in Weka * You can pick which clustering algorithm you want to use and how many clusters you want. Clustering in Weka * Clustering is unsupervised, so you want it to ignore your class attribute! Click here Select the class attribute Clustering in Weka * You can evaluate the clustering in comparison with class attribute assignments Adding a Cluster Feature Adding a Cluster Feature * You should set it explicitly to ignore the class attribute * Set the pulldown menu to No Class Why add cluster features? Class 1 Class 2 Why add cluster features? Class 1 Class 2 Clustering with Weka K-means and FarthestFirst: disjoint flat clusters EM: statistical approach Cobweb: hierarchical clustering K-Means You choose the number of clusters you want You might need to play with this by looking at what kind of clusters you get out K initial points chosen randomly as cluster centriods All points assigned to the centroid they are closest to Once data is clustered, a new centroid is picked based on relationships within the cluster K-Means Then clustering occurs again using the new centroids This continues until no changes in clustering take place Clusters are flat and disjoint K-Means K-Means K-Means K-Means K-Means EM: Expectation Maximization Does not base clustering on distance from a centroid Instead clusters based on probability of class assignment Overlapping clusters rather than disjoint clusters Every instance belongs to every cluster with some probability EM: Expectation Maximization Two important kinds of probability distributions Each cluster has an associated distribution of attribute values for each attribute Based on the extent to which instances are in the cluster Each instance has a certain probability of being in each cluster Based on how close its attribute values are to typical attribute values for the cluster Probabilities of Cluster Membership Initialized 65% B 35% A 25% B 75% A Central Tendencies Computed Based on Cluster Membership 65% B 35% A A B 25% B 75% A Cluster Membership Re-Assigned Probabilistically 75% B 25% A A B 35% B 65% A Central tendencies Re-Assigned Based on Membership 75% B 25% A A B 35% B 65% A Cluster Membership Reassigned 60% B 40% A A B 45% B 55% A EM: Expectation Maximization Iterative like k-means – but guided by a different computation Considered more principled than k-means, but much more computationally expensive Like k-means, you pick the number of clusters you want Advanced Statistical Models Quick View of Bayesian Networks Windy Play Normally with Naïve Bayes you have simple conditional probabilities Humidity Outlook Temperature P[Play = yes | Humitity = high] Quick View of Bayesian Networks Windy Play With Bayes Nets, there are interactions between attributes Humidity Outlook P[play = yes & temp = hot | Humidity = high] Similar likelihood computation for an instance You Temperature will still have one conditional probability per attribute to multiply together But they won’t all be simple Humidity is related jointly to temperature and play Quick View of Bayesian Networks Windy Humidity Outlook Temperature Learning algorithm needs to find the shape of the network Probabilities come from counts Two stages – similar idea to “kernel methods” Play Doing Optimization in Weka Optimizing Parameter Settings Test Use a modified form of crossvalidation: 1 Iterate over settings 2 Validation 3 4 5 Compare performance over validation set; Pick optimal setting Test on Test Set Train Still N folds, but each fold has less training data than with standard cross validation Or you can have a hold-out Validation set you use for all folds Remember! Cross-validation is for estimating your performance If you want the model that achieves that estimated performance, train over the whole set Same principle for optimization Estimate your tuned performance using cross validation with an inner loop for optimization When you build the model over the whole set, use the settings that work best in crossvalidation over the whole set Optimization in Weka Divide your data into 10 train/test pairs Tune parameters using cross validation on the training set (this is the inner loop) Use those optimized settings on the corresponding test set Note that you may have a different set of parameter setting for each of the 10 train/test pairs You can do the optimization in the Experimenter Train/Test Pairs * Use the StratifiedRemoveFolds filter Setting Up for Optimization * Prepare to save the results •Load in training sets for all folds •We’ll use cross validation Within training folds to Do the optimization What are we optimizing? Let’s optimize the confidence factor. Let’s try .1, .25, .5, and .75 Add Each Algorithm to Experimenter Interface Look at the Results * Note that optimal setting varies across folds. Apply the optimized settings on each fold * Performance on Test1 using optimized settings from Train1 What if the optimization requires work by hand? Do you see a problem with the following? Do feature selection over the whole set to see which words are highly ranked Create user defined features with subsets of these to see which ones look good Add those to your feature space and do the classification What if the optimization requires work by hand? The problem is that is just like doing feature selection over your whole data set You will overestimate your performance So what’s a better way of doing that? What if the optimization requires work by hand? You could set aside a small subset of data Using that small subset, do the same process Then use those user defined features with the other part of the data Take Home Message Instance based learning and clustering both make use of similarity metrics Clustering can be used to help you understand your data or to add new features to your data Weka provides opportunities to tune all of its algorithms through the object editor You can use the Experimenter to tune the parameter settings when you are estimating your performance using crossvalidation