INLÄMNINGSUPPGIFT3 DMI2005 Målet med denna inlämningsuppgift är att lära sig hur man kan bygga och utvärdera klassificerare i Matlab med hjälp PRTool4, en toolbox i skriven i Matlab för statistisk mönsterigenkänning från Delft University of Technology i Holland. För att komma igång måste du (om du inte redan gjort det) först ladda ned och packa upp arkivet kodOchManualInlupp2.zip som finna att ladda ned på hemsidan. Efter extraktion av arkivet ska du ha en katalogstruktur som innehåller följande: En pdf fil som heter PRTools4.0.pdf och är manualen till toolboxen En underkatalog som heter prtools4.0download050905 som innehåller toolboxen En underkatalog som heter glue och innehåller filer som saknas i version 4.0 av toolboxen som behövs för inlämningsuppgiften. Innan du går vidare: För att kunna göra denna uppgift på ett bra sätt bör du först ha lärt dig hur PRTools är uppbyggd och känna till strukturerna dataset och mapping med tillhörande tillåtna operationer. Denna uppgift består av två delar. I den första delen är målet att lära sig hur man designar och utvärderar klassificerare i PRTools baserat på simulerade exempel. Filer som behövs för att lösa dessa uppgifter finns att ladda ned från kurshemsidan. I den andra delen är målet att lära sig hur man kan använda PRTools på verkliga problem. Data och filer som behövs för att försöka lösa dessa verkliga problem finns också att ladda ned från kurshemsidan. PRTools - EXERCISE II In Exercise I, you learned how about how the toolbox is organized around a particular datastructure called dataset and about basic operations called mappings that one may perform on dataset objects. Before starting this time, remember again to add the following lines at the beginning of your code: clear all close all %Add paths to the list of matlab paths path=pwd disp(['addpath ''' path '''']) eval(['addpath ''' path '''']) cd .. %jump up from your personal catalog called myfiles cd prtools4.0download050905 1 path=pwd disp(['addpath ''' path '''']) eval(['addpath ''' path '''']) cd .. %jump up again cd datasets path=pwd disp(['addpath ''' path '''']) eval(['addpath ''' path '''']) cd .. cd glue %Extra files from PRTools3.2.5 not available in PRTtools4.0 path=pwd disp(['addpath ''' path '''']) eval(['addpath ''' path '''']) cd .. cd myfiles % now back in the catalog myfiles This clears all variables and closes all windows. Finally the catalog you are starting from is added to the search path used by Matlab to check for functions to run and data to load. Finally, simply add the line GridResolution=gridsize(100) which increases the resolution in contour plots below from the default grid resolution gridsize(30). Task 1 – Classifier Design and Testing Using Simulated Data A special case of a mapping is a classifier. It maps (transforms) a dataset into score values or class posterior probability estimates. They can be used in an untrained as well as in a trained mode. When applied to a dataset, in the first mode the dataset is used for training and a classifier is generated, while in the second mode the dataset is classified. Unlike mappings, fixed classifiers don’t exist. Some important classifiers are: fisherc Fisher classifier qdc Quadratic classifier assuming normal densities udc Quadratic classifier assuming normal uncorrelated densities ldc Linear classifier assuming normal densities with equal covariance matrices nmc Nearest mean classifier parzenc Parzen density based classifier 2 knnc k-nearest neighbor classifier treec Decision tree svc Support vector classifier lmnc Neural network classifier trained by the Levenberg-Marquardt rule In this task, simulated data sets will be generated to design and test different classification algorithms. 3.1 CLASSIFIER DESIGN, DECISION BOUNDARIES, APPARENT ERRORS Let us begin with a simple example which illustrates classifier design and how to plot decision boundaries in 2D scatterplots by plotc. Generate a dataset, make a scatterplot, train and plot some classifiers by D = gendath([20 20]); scatterd(D) w1 = fisherc(D); w2 = knnc(D,3); w3 = qdc(D); plotc({w1,w2,w3}) Plot in a new scatterplot of the dataset D a series of classifiers computed by the k-NN rule (knnc) for various values of k between 1 on 10. Look at the influence of the neighborhood size on the classification boundary. Now we continue with another example: Generate a dataset A by gendath 60 examples from each class. Use gendat to split A into one set of 30 examples per class called TrainSet and another set with the remaining examples called TestSet. Write help gendath to read about how the dataset is generated from two normal distributions. Compute the Fisher classifier by wFisher = fisherc(TrainSet). Make a scatterplot of TrainSetand plot the classifier decision boundary by plotc(wFisher). Apply the Fisher projection to the training set by mappedTrainSetFisher = map(TrainSet,wFisher). Compute class assignments by classAssignmentsTrainSetFisher=labeld(mappedTrainSetFisher)'. Extract the true assignments by trueAssignmentsTrainSet=TrainSet.nlab'. Calculate the number of apparent (training) errors by: noOfApparentErrorsFisher=sum(classAssignmentsTrainSetFisher~=trueAssignmen tsTrainSet) Also use testc to compute the apparent error: [e_apparentFisher,noOfErrorsPerClassTestSetFisher]=testc(mappedTrainSetFisher ) Repeat for the support vector machine classifier svc, the kNN classifier knnc (k=3), and the model based quadratic classifier qdc. After designing all the classifiers, make a 3 scatterplot of a again and plot all the decision boundaries in the same figure using the commands plotc(wFisher,'k') plotc(wkNN,'m:') plotc(wsvc,'b--') plotc(wqc,'g-.') Try to sort out which boundaries are associated with each classifier. Insert small text boxes in the plot to indicate close to each boundary what classifier it is associated with. Save the resulting graph for your report. Note that some of the classifiers have two or even more decision bondaries whereas Fisher and the support vector machine have only one each. 3.2 CLASSIFIER DESIGN AND TEST ERRORS Repeat 3.1 but test the classifiers using the test examples instead. Thus the apparent errors will be replaced by test errors and the decision boundaries will be displayed together with the test examples. 3.3 TRUE ERRORS – TESTING USING MANY EXAMPLES Generate 10000 test examples per class by TestBig=gendath([10000 10000]); and determine accurate estimates of the performance of e.g. Fishers classifier by mappedTestBigFisher = map(TestBig,wFisher); [e_trueFisher,noOfErrorsPerClassTtrueFisher]=testc(mappedTestBigFisher) Repeat this for all the classifiers designed. 3.4 RECEIVER OPERATOR CHARACTERISTICS Write help roc to understand more about how to create receiver operator characteristic (ROC) curves and then write code which makes it possible to calculate ROC curves for the classifiers designed. For example, to determine the ROC for the Fisher classifier one has to write something like EFisher=roc(TestSet,wFisher,desiredClass,noOfPointsInROC); This results in a structure EFisher that can be used to generate a ROC curve simply by plotr(EFisher) Try this and you will find that the ROC curves do not display the probability of detection as a function of the probability of false alarm. To obtain such a graph, one has to write the following lines: EFisher.error=1-EFisher1.error; 4 EFisher.ylabel='P_{D}'; EFisher.xlabel='P_{FA}'; Now plotr(EFisher) will result in a familiar ROC curve with the expected quantities written out on the axes. Compute ROC curves for all the classifers. Since the plotr command can take several ROC curves, it is possible to present them in the same graph using a command like plotr({EFisher,EkNN,Eqc}); where EkNN and Eqc are the ROCs for kNN and the quadratic classifier, respectively. As you should note, the resulting ROC curves are not smooth at all and they are unstable: Repeat the data generation a few times and study how the curves are changing. Repeat the above procedure but use the test set TestBig instead. Now the ROC curves should become very smooth. 3.5 MULTI-CLASS PROBLEMS In many cases, we consider binary (two class) classification but often there are more than two classes. Try to understand and run the following lines of code which generates data from four classes, present the examples in a scatterplot together with decision boundaries and colored decision regions, one for each class: A=gendath([20,20]) dataA=getdata(A); B=gendath([20,20]); dataB=getdata(B); dataC=[dataA;dataB+5]; %Adding 5 to all features labsC=genlab([20 20 20 20],[1 2 3 4]'); C=dataset(dataC,labsC); C=setname(C,'A four class data set'); figure scatterd(C); %make scatter plot for right size drawnow; w=qdc(C); plotc(w,'col') hold on scatterd(C) hold off 5 3.6 CROSS VALIDATION Try to understand how to use the function crossval by running the following code which performs 5-fold cross validation: ClassifiersToTest={qdc,knnc([],3)} noOfSplits=5; ProgressReportFlag=1; C=setprior(C,0); %All classes equally likely %Priors needed to avoid warnings from PRTools noOfReps=1; %Note: Not possible to calculate std in this case [WeightedAverageTestError,TestErrorsPerClass,AssignedNumericLabels,] = crossval(C,ClassifiersToTest,noOfSplits,noOfReps,ProgressReportFlag); TestErrorsPerClass_qsc=TestErrorsPerClass{1} TestErrorsPerClass_kNN=TestErrorsPerClass{2} See help crossval and try to run repeatedly crossval 30 times to determine the standard deviation of 5-fold cross validations. 3.6 LEARNING CURVES A learning curve displays the estimated performance as a function of the number of training examples. Run the following code to determine learning curves for the Fisher and kNN (k=3) classifiers: A=gendath([200 200]) %200 examples per class, below up to 100 of them are used %for training noOfRep=10; ProgressFlag=1; trainingSizesPerClass=[4 6 8 10 12 14 16 18 20 30 40 50 60 70 80 85 90 95 100]; figure eClassifiers=cleval(A,{fisherc, knnc([],3), qdc},trainingSizesPerClass,noOfRep,[],ProgressFlag); plotr(eClassifiers,'errorbar') Repeat the above calculations for the more demanding situation where the number of examples available is only 20 per class. Determine and plot the learning curve for the following training sizes [4 6 8 10 12 14 16 18]. Why does the variance increase dramatically when almost all examples are used for training? 6 Task 2 – Classifier Design and Testing Using Real World Data In this task, you are going to design and test classifiers aimed at solving five real world classification problems. 2.1 CLASSIFICATION OF IRIS DATA AND SUBSET FEATURE SELECTION Load the 4-dimensional Iris dataset by a = iris. Design three different types of classifiers using 20 examples per class for design and test each with the remaining examples. Also determine an estimate of the performance after automatic selection of only two features using the function featself and the variable crit set equal to crit='maha-s',see help featself and help feateval for more information. 2.2 CLASSIFICATION OF RINGS, NUTS AND BOLTS In this subtask a classification system will be developed for classifying objects that belong to four different categories: ring, nut-6 (6-sided), nut-4 (4-sided), and bolt. A vision system is available that acquires images of these objects. Using some digital image processing techniques (not part of this project) the images are segmented. After that, each imaged object is represented by a connected component in the resulting binary (logical) image. Figure 1 shows already segmented images containing rings, nuts-6, nuts-4 and bolts. These images are available for training and evaluation, thus providing us with a labeled dataset of 121 objects per class. The classification will be based on so-called normalized Fourier descriptors. Which describe the shapes of the contours of the objects. The basic idea to consider the contour as a periodic curve in 2D which can be represented by a Fourier series. The Fourier descriptors are the coefficients of the Fourier series and may be used to create descriptors that are invariant to rotation and scale. The software provided with the project can produce many descriptors per objects. The goal of your design is to find a classifier that strives for minimal error rate. 7 The software that is provided within this project can calculate up to 64 descriptors denoted by Zk ,where k ranges from −31 up to +32 . The descriptors are normalized such that they are independent from the orientation and the size. However, Z0 and Z1 should not be used, because Z0 does not depend on the shape (but rather on the position) and Z1 is always one (because it is used for the normalization). The given Matlab function, ut_contourfft, offers the possibility to calculate only a selection of the available descriptors. Each image in Figure 1 shows the segments of 121 objects. Thus, extraction of the boundary of each segments, and subsequent determination of the normalized Fourier descriptors yields a training set of 4×121=484 labelled vectors, each vector having 62 elements. An image can be transformed into a set of measurement vectors with the following fragment of code: fdlist = [-31:-1 2:32]; % exclude Z0 and Z1 imrings = imread('rings.tif'); % open and read the image file figure; imshow(imrings); title('rings'); [BND,L,Nring,A] = bwboundaries(imrings,8,'noholes');% extract the boundaries FDS = ut_contourfft(BND,'fdlist',fdlist,'nmag'); % calculate the FDs Zrings = zeros(Nring,length(fdlist)); % allocate space for n=1:Nring Zrings(n,:) = FDS{n}'; % collect the vectors end Note that the function bwboundaries is part of the image processing toolbox in matlab so you need a computer that has access to this toolbox to be able to import your data. Likewise pieces of code are needed to get the measurement vectors from the other classes. The filenames of the four images are: rings.tif, nuts6.tif, nuts4.tif and bolts.tif. The function ut_contourfft accompanies the images. a) Write the code needed to import the data into Matlab and then into PRTools format. Hint: use repmat to create the array with labels. For instance, repmat('ring',[Nring 1]) creates an array of Nring entries containing the string ‘ring’. b) Use 5-fold cross validation to estimate the performance of the kNN classifier for k=3 when the classifier is using the ten Fourier descriptors Z2,Z3,…,Z11 or some other subset of features. c) Try to understand and use the function confmat to compute a confusion matrix which gives an impression which classes are likely to be confused. 2.3 RECOGNITION OF VOICE COMMANDS In this subtask a classification system will be developed for classification of your own voice commands. Using a headset and a microphone, you will be able to record and store 8 training and test examples of your own voice commands in Matlab format. First you should a headset available in our computer lab to create a database with recorded examples of the same words and then use them for feature extraction, training, and testing, respectively. One set of recorded and stored examples are used s prototypes in a first step to perform feature extraction of new inputs. As you know, the feature extraction step is often the most important in computerized pattern recognition. We use dynamic time warping (DTW) together with the prototype examples in the first database to create a six-dimensional informative feature vector for each new input pattern. DTW is a dynamic programming algorithm that calculates the distance between time profiles. (In DTW, an unknown time profile is aligned to prototype time profiles in an N x M grid where N and M are the lengths of the time profiles. The distance is defined as the length of the shortest alignment path found.). DTW is not included in the PRTools toolbox but available via the Matlab function featureextraction. The second part of the database is used in a second step to train and test a classifier based on the features extracted from each example. In other words, each original sound pattern in the set of training and test examples are converted into a short feature vector by means of the feature extraction method discussed above. Main Steps Here follows a short summary of the different steps needed to complete this project. Step 1: Preparations Here we summarize the first important steps: 1. Download the zip-library and install it in one of your directories. 2. Get a headset to be used for the recording of sounds. 3. Write help featureextraction and figure out how to use the function. Step 2: Sound Recording and Recall See the help for Matlab functions wavrecord and wavplay. Get a headset and implement a couple lines of code that record and play back a 1.5 s soundbite. You need to use double as the datatype for wavrecord. Run the code and confirm that the program is able to record and play the word you are expressing. If you are unable to get the sound to work properly from Matlab or if you find any errors in the provided Matlab send an e-mail canders@lcb.uu.se so that we can correct the problems as soon as possible. Step 3: Recroding and Storing Multiple Examples An important task is to record and store multiple copies of each word of interest. Write an m-file called recordWords.m which can be used to record several examples (utterances) of a single word and store them as columns in a matrix called WordMatrix. Each 9 column should correspond to a recorded time interval of 1.5 seconds. In other words a call should look like WordMatrix=recordWords(WORDS,noOfWordExamples) where WORDS is a cell array containing the words written explicitly in letters and where noOfWordExamples define the number of examples of each word to be recorded. You should also write a function playWords.m which makes it possible to listen to the words colleced in a matrix. Step 4: Create Three Word Databases Create a database containing 7 examples each of the words RETURN, LEFT and RIGHT. The database should be in the form of a matrix called WordsForFeatureExtraction in which each of the 21 columns is an example.Also the class membership of each word should be stored in a row vector called targetsForFeatureExtraction. Each element in the vector should be an integer from the set {1,2,3} corresponding to RETURN, LEFT, and RIGHT, respectively. Create another database containing 10 examples of each of of the words. Use featureextraction to extract features from this database and use the output (6x30) to create a PRTools dataset with appropriately labeled classes. a) Turn the training and test examples into PRTools format and design two classifiers of your choice using the training examples. b) Use the pca mapping and scatterd in PRTools to visualize the samples in two and three dimensions. Does separation seem plausible? c) Test the designed classifiers using your test examples and report the corresponding confusion matrix. Good Luck! 10