1 INF 5300 Exercises for Lecture 2 Asbjørn Berge I. P RELIMINARIES Familiarize yourself with classfication and feature selection using PRTools. (www.ifi.uio.no/˜inf3300/grupper/1/PRTools3.2.pdf). Make sure you understand the concepts of and are able to classify data using Gaussian ML classification (see:help ldc and help qdc). Make sure you know what a confusion matrix is and how you plot a scatterplot. Make/choose a simple example and plot a classification boundary on a scatterplot. Compare the results using different classification rules. Look at the help files of the different feature selection commands feateval, featrank, featselb, featself, featseli, featselm, featselo, and featselp. II. F EATURE SELECTION A problem in feature selection is that sometimes the two best features are not the best two features. To find the best two features, we can use brute force and try all possible combinations. This is very expensive. When we want to extract k features d! . For large d and moderate k this becomes huge. For from a total set of d, the number of possibilities becomes kd = (d−k)!k! instance, if we want to extract the 10 most informative pixels from a set of 32 × 32 image, we have to evaluate 3.3 ∗ 1023 possibilities. Exhaustive evaluation is therefore often impossible. In feature selection therefore two elements are needed: • a criterion to evaluate the informativity of a set of features, • an efficient search routine which finds the most promising feature sets. For a tractable search algorithm, we have to rely on heuristic approaches (we just saw that exhaustive search is not feasible). When we want to extract just a few features from a large set of features, we can use the forward selection method. This starts with the best individual feature and keeps adding the subsequent best feature until the final number k of features is reached. When we expect to use most of the features, we can use backward selection. Here we start with the complete set of features and remove the worst features one by one. Although these solutions can give a very large speedup, the solutions are suboptimal. Remember: the two best features are in general not the best two features. Therefore often a hybrid form of feature selection is used - for example a floating feature selection. Exercise 1 First we will create an artificial dataset for which we know there are just a few informative features. Find out what the characteristics of gendatd are and create a dataset containing 40 objects. Next, we rotate this dataset clockwise around the origin 45◦ . We do this by multiplying the dataset by the rotation matrix: 1 1 R= 1 −1 Make a new dataset b from the old dataset by multiplying it with this rotation matrix (note the sizes of the matrices, and be careful in which order you multiply the matrices!). Check your results by making a scatterplot of the two datasets. Finally add 4 extra non-informative features to dataset b. Use the procedure gendats to make two classes which are exactly on top of each other (see help files). Check the size of the new dataset (it should be 40 × 6 now!) and make scatterplots of features (1, 2), (1, 3) and (4, 5). Exercise 2 Given the artificial dataset from exercise 1, would you prefer to use forward or backward feature selection? Any ideas on choice of criteria J to be used? Try forward and backward selection on this dataset using different criteria. To find out which features have been selected, extract the feature indices from the mapping w by: +w Do you find the correct features in both cases? What results do you get by individual feature selection? Exercise 3 Load a well known PR dataset, the ”‘forensic glass dataset”’ using the loader command glass. (You need to add the dataset path with addpath ˜inf5300/data/) Study the different features by plotting them pairwise, choose criteria and feature selection method. Compare different search strategies and report your results, possibly for different classification strategies if needed. Extra exercise It is possible to make feateval use other measures of distance (for example Bhattacharrya or divergence) by a bit of creative Matlab programming. (Hint: you can modify the feateval routine or wrap your distance measure into a ”‘classifier”’ mapping)