INLÄMNINGSUPPGIFT 5 Datoriserad Mönsterigenkänning 2007 Särdragsselektion och verkliga tillämpningar © Signaler och System, Institutionen för Teknikvetenskaper, Uppsala Universitet Målsättning och förberedelser Målet med denna inlämningsuppgift är att du ska få fördjupade kunskaper om hur man kan använda mönsterigenkäningsmetoder för att lösa praktiska problem med hjälp av PRTools. Efter avslutat uppgift ska du kunna bygga ett helt system för datoriserad mönsterigenkänning för en verklig tillämpning. Förberedelserna är samma som för inlämningsuppgift 1 dvs att ladda ned PRTools och skapa en egen katalog myfiles, se inlämningsuppgift 1. Du behöver också ladda ned materialet i filen kodOchDataInlupp5.zip. Redovisning Se inlämningsuppgift 1. 1 IRIS DATA AND SUBSET FEATURE SELECTION (a) Load the 4-dimensional three class Iris dataset by a = iris. To avoid problems with different versions of PRTools, recreate the dataset by the command: a=dataset(a.data, a.nlab). Make make scatterplots of all feature combinations using scatterd with the option ‘gridded’ which produces subplots with all combinations of the four features. (b) Plot in a separate figure the one-dimensional feature densities by means of plotf. Identify visually the best combination of two features. Create a new dataset b using seldat that contains just these two features. Create a new figure by the figure command and create a scatterplot of b. Compare the first five rows of a and b to confirm that b actually is the desired subset. (b) Using the 4-dimensional Iris dataset a = iris again, design the three different classifiers qdc, knnc and treec using 20 examples per class for design and test each with the remaining examples. Calculate Bayesian posterior distributions for the unknown probabilities of misclassification. 1 (c) Retrain the classifiers with the two features obtained with automatic selection using the function featself and the variable crit set equal to crit='maha-s'. Hint: See help featself and help feateval for more information. Use the test set to obtain performance estmates. Calculate Bayesian posterior distributions of the unknown probabilities and compare with (b). Do the distributions overlap or is there an obvious improvement?! 2 CLASSIFICATION OF RINGS, NUTS AND BOLTS In this subtask a classification system will be developed for classifying objects that belong to four different categories: ring, nut-6 (6-sided), nut-4 (4-sided), and bolt. A vision system is available that acquires images of these objects. Using some digital image processing techniques (not part of this project) the images are segmented. After that, each imaged object is represented by a connected component in the resulting binary (logical) image. Figure 1 shows already segmented images containing rings, nuts-6, nuts-4 and bolts. These images are available for training and evaluation, thus providing us with a labeled dataset of 121 objects per class. The classification will be based on so-called normalized Fourier descriptors. Which describe the shapes of the contours of the objects. The basic idea to consider the contour as a periodic curve in 2D which can be represented by a Fourier series. The Fourier descriptors are the coefficients of the Fourier series and may be used to create descriptors that are invariant to rotation and scale. The software provided with the project can produce many descriptors per objects. The goal of your design is to find a classifier that strives for minimal error rate. The software that is provided within this project can calculate up to 64 descriptors denoted by Zk ,where k ranges from −31 up to +32 . The descriptors are normalized such that they are independent from the orientation and the size. However, Z0 and Z1 should not be used, because Z0 does not depend on the shape (but rather on the position) and Z1 is always one (because it is used for the normalization). The given Matlab function, ut_contourfft, offers the possibility to calculate only a selection of the available descriptors. 2 Each image in Figure 1 shows the segments of 121 objects. Thus, extraction of the boundary of each segments, and subsequent determination of the normalized Fourier descriptors yields a training set of 4×121=484 labeled vectors, each vector having 62 elements. An image can be transformed into a set of measurement vectors with the following fragment of code: fdlist = [-31:-1 2:32]; % exclude Z0 and Z1 imrings = imread('rings.tif'); % open and read the image file figure; imshow(imrings); title('rings'); [BND,L,Nring,A] = bwboundaries(imrings,8,'noholes');% extract the boundaries FDS = ut_contourfft(BND,'fdlist',fdlist,'nmag'); % calculate the FDs Zrings = zeros(Nring,length(fdlist)); % allocate space for n=1:Nring Zrings(n,:) = FDS{n}'; % collect the vectors end Note that the function bwboundaries is part of the image processing toolbox in matlab so you need a computer that has access to this toolbox to be able to import your data. Similar pieces of code are needed to get the measurement vectors from the other classes. The filenames of the four images are: rings.tif, nuts6.tif, nuts4.tif and bolts.tif. The function ut_contourfft accompanies the images. a) Write the code needed to import the data into Matlab and then into PRTools format. Hint: use repmat to create the array with labels. For instance, repmat('ring',[Nring 1]) creates an array of Nring entries containing the string ‘ring’. Two data sets A and B may be merged into one new dataset using C=[A;B]; b) Use 5-fold cross validation to estimate the average performance of the kNN classifier for k=3 when the classifier is using the ten Fourier descriptors Z2,Z3,…,Z11. c) Keep 21 examples per class as a holdout test set and use the rest for classifier design. Use the result from the test set to calculate and plot the posterior P(q|k,N) for the unknown performance q. Based on the plot, determine a Bayesian confidence interval where there is a visible difference between P(e,k.N) and the coordinate axis. d) Use the function confmat to compute a confusion matrix for the reduced data which gives an impression which classes are likely to be confused. (e*) Implement the greedy pairs feature selection algorithm when used together with the kNN classifer for k=3 and apply it to the iris data set. NOTE: You do not have to write a function that is compatible with the data structures in PRTools!! Simply write your algorithm in matlab where in each iteration classification is performed using knnc using the pair of features selected. Determine the optimal number of pairs of features according to a five fold cross validation. Good Luck! 3