INF 5300 Exercises for Lecture 1 Asbjørn Berge

advertisement
1
INF 5300 Exercises for Lecture 1
Asbjørn Berge
I. P RELIMINARIES
Familiarize yourself with classfication and feature selection using PRTools. (www.ifi.uio.no/˜inf3300/PRTools4.0.pdf). Make
sure you understand the concepts of and are able to classify data using Gaussian ML classification (see:help ldc and
help qdc). Make sure you know what a confusion matrix is and how you plot a scatterplot. Make/choose a simple example
and plot a classification boundary on a scatterplot. Compare the results using different classification rules. Look at the help files
of the different feature selection commands feateval, featrank, featselb, featself, featseli, featselm,
featselo, and featselp.
II. F EATURE SELECTION
A problem in feature selection is that sometimes the two best features are not the best two features. To find the best two
features, we can use brute force and try all possible combinations.
This is very expensive. When we want to extract k features
d!
. For large d and moderate k this becomes huge. For
from a total set of d, the number of possibilities becomes kd = (d−k)!k!
instance, if we want to extract the 10 most informative pixels from a set of 32 × 32 image, we have to evaluate 3.3 ∗ 1023
possibilities. Exhaustive evaluation is therefore often impossible.
In feature selection therefore two elements are needed:
• a criterion to evaluate the informativity of a set of features,
• an efficient search routine which finds the most promising feature sets.
For a tractable search algorithm, we have to rely on heuristic approaches (we just saw that exhaustive search is not feasible).
When we want to extract just a few features from a large set of features, we can use the forward selection method. This
starts with the best individual feature and keeps adding the subsequent best feature until the final number k of features is
reached.
When we expect to use most of the features, we can use backward selection. Here we start with the complete set of features
and remove the worst features one by one. Although these solutions can give a very large speedup, the solutions are suboptimal.
Remember: the two best features are in general not the best two features. Therefore often a hybrid form of feature selection
is used - for example a floating feature selection.
Exercise 1
First we will create an artificial dataset for which we know there are just a few informative features. Find out what the
characteristics of gendatd are and create a dataset containing 40 objects. Next, we rotate this dataset clockwise around the
origin 45◦ . We do this by multiplying the dataset by the rotation matrix:
1 1
R=
1 −1
Make a new dataset b from the old dataset by multiplying it with this rotation matrix (note the sizes of the matrices, and
be careful in which order you multiply the matrices!). Check your results by making a scatterplot of the two datasets.
Finally add 4 extra non-informative features to dataset b. Use the procedure gendats to make two classes which are exactly
on top of each other (see help files). Check the size of the new dataset (it should be 40 × 6 now!) and make scatterplots of
features (1, 2), (1, 3) and (4, 5).
Exercise 2 Given the artificial dataset from exercise 1, would you prefer to use forward or backward feature selection? Any
ideas on choice of criteria J to be used? Try forward and backward selection on this dataset using different criteria.
To find out which features have been selected, extract the feature indices from the mapping w by: +w Do you find the
correct features in both cases? What results do you get by individual feature selection?
Exercise 3 Load a well known PR dataset, the ”‘forensic glass dataset”’ using the loader command glass. (You need to add
the dataset path with addpath ˜inf5300/data/) Study the different features by plotting them pairwise, choose criteria
and feature selection method. Compare different search strategies and report your results, possibly for different classification
strategies if needed.
Extra exercise It is possible to make feateval use other measures of distance (for example Bhattacharrya or divergence)
by a bit of creative Matlab programming. (Hint: you have two options; modify the feateval routine or wrap your distance
measure into a ”‘classifier”’ mapping)
Download