Machine Learning in Practice Lecture 22 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute Plan for the Day Announcements Questions? Multi-Level Cross-Validation Feature Selection Setting Up the Experimenter for Regression Problems Cascading Classifiers Advanced Cross Validation for Hierarchical Models New Features Animals Plants Living Things Target Class Raw features …… Let’s say you want to train a model to predict whether something is a living thing Let’s say you know that Animals have a lot in common with each other, and Plants have a lot in common with each other You think it would be easier to predict Living Thing if you first have models to predict Animal and Plant Describe the steps you’ll go through to build the Living Thing model Remember the cluster feature example Class 1 Class 2 Added structure makes it easier to detect Class1 Class 1 Class 2 Advanced Cross Validation for Hierarchical Models New Features Classifier A Classifier B Classifier C Target Class Raw features …… You will use the result of Classifiers A and B to train Classifier C You need labeled data for Class A and Class B to train classifiers A and B, but you don’t want to train Classifier C with those perfect labels You can use cross validation to get “noisy” versions of A and B in the training data for C Advanced Cross Validation D d f F ABC 1 2 3 4 5 6 7 Let’s say each instance in your data has features A, B, and C You are trying to predict F You want to train feature D, which you think will help you detect F better Advanced Cross Validation D d f ABC 1 2 3 F Fold 1: Train a d classifier over segments 2-7 Use this model to apply D labels to segment 1 Train an f classifier over 2-7 Since D labels in 1 will be noisy, you need to train your f classifier with noisy D labels in 2-7 You can get those noisy labels using cross validation within 2-7 4 5 6 7 Use the f model to apply F labels to segment 1 Advanced Cross Validation D d f F ABC 1 2 3 4 5 6 7 Think about how to get those noisy labels using cross validation within 2-7 Train a d classifier on 3-7 to apply noisy D labels to 2 Train a d classifier on 2+4-7 to apply noisy D labels to 3 Etc. Now you have noisy D labels for 27 in addition to the perfect D labels you started with You will use these noisy labels to train f (not the perfect ones!) Advanced Cross Validation D d f F ABC 1 2 3 4 5 6 7 Think about how to get those noisy labels using cross validation within 2-7 Train a d classifier on 3-7 to apply noisy D labels to 2 Train a d classifier on 2+4-7 to apply noisy D labels to 3 Etc. Now you have noisy D labels for 27 in addition to the perfect D labels you started with You will use these noisy labels to train f (not the perfect ones!) Remember: Dumping Labels from Weka Save output buffer Pull results section out Use the predicted column NOTE: If you do this using weka’s crossvalidation, you won’t be able to match up the instance numbers!!! Feature Selection Why do irrelevant features hurt performance? Divide-and-conquer approaches have the problem that the further down in the tree you get, the less data you are paying attention to it’s easy for the classifier to get confused Naïve Bayes does not have this problem, but it has other problems, as we have discussed SVM is relatively good at ignoring irrelevant attributes, but it can still suffer Also, it’s very computationally expensive with large attribute spaces Two Paradigms for Attribute Selection Wrapper Method Evaluate the subset using the algorithm that will be used for the classification in terms of how the classifier does with that subset Use a search method like Best First Filter Method Use an independent metric of feature goodness Rank and then select Don’t be confused – not the “standard” usage of filter versus wrapper How do you evaluate feature goodness apart from the learning algorithm? Notice a combination of scoring heuristics and search methods How do you evaluate feature goodness apart from the learning algorithm? (evaluating subsets of features) Look for the smallest set of attributes that distinguishes every training instance from every other training instance Problem occurs if there are two instances with the same attribute values but different classes You could use decision tree learning to pick out a subset of attributes to use with a different algorithm It will have no effect if you use it with decision trees It might work well with instance based learning – to avoid having it be confused by irrelevant attributes How do you evaluate feature goodness apart from the learning algorithm? (evaluating individual features) You can rank attributes for decision trees using 1R to compensate for the bias towards selecting features that branch heavily Look at the correlation between each features and the class attribute Efficiently Navigating the Attribute Space Evaluating individual attributes and then ranking them is the most efficient approach for attribute selection That’s what we have been doing up until now with ChiSquaredAttributeEval Searching for the optimal subset of features based on evaluating subsets together is more complex Exhaustive search for the optimal subset of attributes is not tractable Use a greedy search BestFirst Efficiently Navigating the Attribute Space Remember that greedy methods are efficient, but they sometimes get stuck in locally optimal solutions Forward selection: Start with nothing and add attributes On each round, pick the attribute that will have the biggest estimated positive effect on performance Backward elimination: Start with the whole set and prune On each round, select the attribute that seems to be dragging down performance the most Bidirectional search methods combine these methods Forward Selection Forward Selection * Pick the most predictive feature. Forward Selection * Pick the next most predictive feature, or the one that gives the pair the most predictive power altogether – in the case of the wrapper method, using the classification algorithm you will eventually use. Forward Selection * Pick the next most predictive feature, or the one that gives the set the most predictive power altogether – in the case of the wrapper method, using the classification algorithm you will eventually use. Backwards Elimination * Pick the least predictive feature, i.e., the one you can drop without hurting the predictiveness of the remaining features, or even helping it. Backwards Elimination * Pick the least predictive feature, i.e., the one you can drop without hurting the predictiveness of the remaining features, or even helping it. Backwards Elimination * Pick the least predictive feature, i.e., the one you can drop without hurting the predictiveness of the remaining features, or even helping it. Backwards Elimination * Pick the least predictive feature, i.e., the one you can drop without hurting the predictiveness of the remaining features, or even helping it. Efficiently Searching the Attribute Space You can use a beam search method rather than selecting a single attribute on each round Race search: stop when you don’t get any statistically significant increase from one round to the next Option: Schemata search is like race search except that you rank attributes first and then race the top ranking attributes (more efficient!) Sometimes you include a random selection of other attributes in with the selected attributes Efficiently Searching the Attribute Space Different approaches will make different mistakes Backward elimination produces larger attribute sets and often better performance than forward selection Forward selection is good for eliminating redundant attributes or attributes with dependencies between them, which is good for Naïve Bayes Better model if you want to be able to understand the trained Selecting an Attribute Selection Technique Attribute Selection Options: Evaluating Subsets CfsSubsetEval: looks for a subset that are highly correlated with the predicted class but have low inter-correlation ClassifierSubsetEval: Evaluates subset using a selected classifier to compute performance ConsistencySubsetEval: evaluates the goodness of a subset of attributes based on consistency of instances that are close to each other in the reduced attribute space Attribute Selection Options: Evaluating Subsets SVMAttributeEval: backwards elimination, beam search technique, you specify the number or percent to get rid of on each iteration, stops when it doesn’t help anymore WrapperSubsetEval: Just like ClassifierSubsetEval ChiSquaredAttributeEval: evaluates the worth of a feature by computing the chi-squared statistic of the attribute in relation to the predicted class GainRatioAttributeEval: Like Chi-squared but using GainRatio Attribute Selection Options: Evaluating Subsets InfoGainAttributeEval: Like Chi-squared but using Information Gain OneRAttributeEval: Like Chi-squared but looks at accuracy of using single attributes for classification ReliefAttributeEval: evaluates the worth of an attribute based on surrounding instances based on that attribute SymetricalUncertAttributeEval: Like Chisquared by uses symetrical uncertainty Subsets in a new vector space PrincipalComponents: use principal components analysis to select a subset of eigenvectors from the diagonalized covariance matrix that accounts for a certain percentage of the variance in the predicted class A cheaper method is to use a random projection onto a smaller vector space Not as good as principal components analysis But not that much worse either Remember Matrix Multiplication * Notice you ended up with fewer attributes! What can we do with that? Using linear algebra Project one vector space onto a more compact one Then select the top N dimensions that as a set explain the most variance in your data Subsets in a new vector space PrincipalComponents: use principal components analysis to select a subset of eigenvectors from the diagonalized covariance matrix that accounts for a certain percentage of the variance in the predicted class A cheaper method is to use a random projection onto a smaller vector space Not as good as principal components analysis But not that much worse either Selecting an Attribute Selection Technique E.g., Forward Selection * Forward selection and backward selection are much slower than just ranking attributes based on a metric that can be applied to one attribute at a time. So If you are using such a metric, just use a ranking selection technique rather than a search selection technique. Using search won’t change the result, it will just waste a lot of time! Does it matter which one you pick? Consider the Spam data set Decision trees worked well because we needed to consider interactions between attributes Cfsubseteval and ChiSquaredAttibuteEval consider attributes out of context Both significantly reduce performance In this case we’re harming performance because we’re ignoring interactions between attributes at the selection stage Cfsubseteval is significantly better than ChiSquaredAttributeEval for the same number of features But with ChiSquaredAttributeEval you can choose to have more features, and then you do not get a degradation in performance eventhough you reduce the feature space Take Home Message Multi-level cross-validation helps prevent over-estimating performance for tuned models Needed for cascaded classification in addition to more typical tuning Wide range of feature selection techniques Make sure your options are consistent with one another Search based approaches are much slower than simple ranking approaches