INLÄMNINIGSUPPGIFT2 DMI2005 Målet med denna inlämningsuppgift är att du ska bekanta dig med PRTool4, en toolbox i skriven i Matlab för statistisk mönsterigenkänning från Delft i Holland. För att komma igång måste du först ladda ned och packa upp arkivet kodOchManualInlupp2.zip som finna att ladda ned på hemsidan. Efter extraktion av arkivet ska du ha en katalogstruktur som innehåller följande: En pdf fil som heter PRTools4.0.pdf och är manualen till toolboxen En underkatalog som heter prtools4.0download050905 som innehåller toolboxen En underkatalog som heter glue och innehåller filer som saknas i version 4.0 av toolboxen som behövs för inlämningsuppgiften. I denna katalog måste du till sist också ladda ned och spara katalogen datasetInlupp2 som innehåller bilder och andra data som används i de uppgifter ni ska utföra. Om du får minnesproblem vid nedladdningen av denna katalog så finns den tillgänglig i katalogen W:/dmi05. Den kan ni läsa direkt från matlab om ni sitter i datorsalen på Magistern. Notera till sist att i instruktionerna nedan antas att du har lagt upp ovanstående filer i en katalog enligt ovan och att du sedan har skapat en ytterligare underkatalog med namnet myfiles i vilken du placerar dina egna matlabprogramfiler. Redovisning sker genom att skicka in välkommenterad kod och en kort text i Word med biler och text som visar att din kod funderar. PRTools - EXERCISE I In this first exercise about PRTools, you will learn how about how the toolbox is organized around a particular datastructure called dataset and about basic operations called mappings that one may perform on dataset objects. Remember to add the following lines at the beginning of your code: clear all close all %Add paths to the list of matlab paths path=pwd disp(['addpath ''' path '''']) 1 eval(['addpath ''' path '''']) cd .. %jump up from your personal catalog called myfiles cd prtools4.0download050905 path=pwd disp(['addpath ''' path '''']) eval(['addpath ''' path '''']) cd .. %jump up again cd datasets path=pwd disp(['addpath ''' path '''']) eval(['addpath ''' path '''']) cd .. cd glue %Extra files from PRTools3.2.5 not available in PRTtools4.0 path=pwd disp(['addpath ''' path '''']) eval(['addpath ''' path '''']) cd .. cd myfiles % now back in the catalog myfiles This clears all variables and closes all windows. Finally the catalog you are starting from is added to the search path used by matlab to check for functions to run and data to load. Task 1 –The data structure dataset PRTools entirely deals with sets of objects represented by vectors in a feature space. The central data structure is called dataset. It consist of a matrix of size m x k; m row vectors representing the objects given by k features each. Attached to this matrix is a set of m labels (strings or numbers), one for each object and a set of k feature names (also strings or numbers), one for each feature. Moreover, a set of prior probabilities, one for each class, is stored. Objects with the same label belong to the same class. In most help files in PRTools, a dataset is denoted by A. Almost all routines can handle multiclass objects. Some useful routines to handle datasets are: dataset Define dataset from data matrix and labels getdata Retrieve data from dataset getlab Retrieve object labels getfeat Retrieve feature labels seldat Select a subset of a dataset 2 genlab Generate dataset labels setdat Define a new dataset from an old one by replacing its data renumlab Convert labels to numbers Sets of objects may be given externally or may be generated by one of the data generation routines in PRTools. Their labels may be given externally or may be the results of a classification or a cluster analysis. A dataset containing 10 objects with 5 random measurements can be generated by: >> data = [rand(4,5);randn(2,5);rand(4,5)+1] >> a = dataset(data) 10 by 5 dataset with 1 classes: [10] 1.1 Write a m-file that runs these rows and then check the organization of a by means of >> struct(a). In this example no labels are supplied, therefore only one class is detected with all 10 objects (examples) included. 1.2 Split the 9 examples into three classes by means of the following labeling: >> labs = [1 1 1 1 2 2 3 3 3 3]'; % labs should be a column vector >> a = dataset(a,labs) 10 by 5 dataset with 3 classes: [4 2 4] Note that the labels have to be supplied as a column vector. 1.3 A more convenient way to assign labels to a dataset is offered by the routine genlab in combination with the Matlab char command. Run the following lines to split the 10 examples into the 3 classes ‘apple’, ‘pear’, and ‘banana’ : >> labs = genlab([4 2 4],char('apple','pear','banana')) >> a = dataset(a,labs) 10 by 5 dataset with 3 classes: [4 4 2] Note that the class names are sorted in alphabetical order, Hence [4 4 2] denotes the number of objects in each class. 1.4 Use the routine setfeatlab to give the five features in a the names ‘feature1’, ‘feature2’,’feature3’,’feature4’, and ‘feature5’, respectively. Then use the routines getlab and getfeat to retrieve the object class labels and the feature labels of a. 1.5 The fields of a dataset can be made visible by converting it to a structure, e.g.: >> struct(a) 3 data: [10x5 double] lablist: [3x6 char] nlab: [10x1 double] labtype: ’crisp’ targets: [] featlab: [5x1 double] featdom: {[] [] [] [] []} prior: [] objsize: 10 featsize: 5 ident: {10x1 cell} version: {[1x1 struct] ’31-Jan-2003 10:02:55’} name: [] user: [] Check the on-line information on datasets (help datasets, also printed in the PRTools manual) where the meaning of these fields is explained. Each field may be changed by a set-command e.g. one may replace the data set: >> b = setdata(a,rand(10,5)); Our should try this command and then try to retrieve a field value by a similar getcommand, e.g. >> classnames = getlablist(a) In nlab an index is stored for each object to the list of class names lablist. Note that this list is alphabetically ordered. The size of a dataset can be found by both, size and getsize. Write: >> [m,k] = size(a); >> [m,k,c] = getsize(a); Confirm that number of objects is returned in m, the number of features in k and the number of classes in c. The class prior probabilities are stored in prior. It is by default set to the class frequencies if the field is empty. The data in the data itself can also be retrieved by double(a) or more simple by +a, try both. 1.6 Run once more the commands >> >> >> >> data = [rand(4,5);randn(2,5);rand(4,5)+1] a = dataset(data) labs = genlab([4 2 4],char('apple','pear','banana')) a = dataset(a,labs) Have a look of the help-information of seldat. Use the routine to extract the banana data from a and check this by inspecting the result of +a. Note that classes and features need to be specified by their indices and that the class names are sorted in alphabetical order. 4 1.7 One way to inspect a dataset is to make a scatterplot of the objects in the dataset. For this the function scatterd is supplied. This plots each object in a dataset in a 2D graph, using a colored marker when class labels are supplied. When more than two features are present in the dataset, the first two are used. For obtaining a scatterplot of two other features they have to be explicitly extracted first, e.g. a1 = a(:,[2 5]); It is also possible to create 3D scatterplots. Use scatterd to make a scatterplot of the features 2 and 5 of dataset a. Open a new figure using the figure command and try also scatterdui. Use its buttons to study what happens for different features, in particular when the same feature is used on both axes. 1.8 Open a new figure and make a 3-dimensional scatterplot by scatterd(a,3) and try to rotate it. 1.9 Load the 4-dimensional Iris dataset by a = iris and make scatterplots of all feature combinations using the gridded option of scatterd. Plot in a separate figure the one-dimensional feature densities by means of plotf. Identify visually the best combination of two features. Create a new dataset b using seldat that contains just these two features. Create a new figure by the figure command and plot in this a scatterplot of b. Compare the first five rows of a and b to confirm that b actually is the desired subset. 1.10 Generate your own data set that that consists of two 2-D uniformly distributed classes of objects using the rand command. Transform the sets such that for the [xmin xmax; ymin ymax] intervals the following holds: [0 2; -1 1] for class 1 and [1 3; 1.5 3.5] for class 2. Generate 50 objects for each class. An easy way is to do this for x and y coordinates separately and combine them afterwards. Label the features by 'area' and 'perimeter'. Check the result by scatterd and by retrieving object labels and feature labels. 1.11 Load a part of a face dataset by a = faces([1 3],[1:4]) which reads the first four pictures of the subjects 1 and 3. They may be displayed by show(a). Include class labels by x = dataset(a,genlab([4 4]). Task 2 –The data structure mapping In PRTools datasets are transformed by mappings. These are procedures that map a set of objects form one space into another. Examples are feature selection, feature rescaling, rotations of the space, classification. In a mapping (we use almost everywhere the variable w for mappings) various information is stored, like the dimensionalities of input 5 and output space, parameters that define the transformation and the routine that is used for executing the transformation. Give struct(w)to see all fields and print out the mfile mapping.m to get more information about the result from a mapping operation. Often a mapping has to be trained, i.e. it has to be adapted to a training set by some estimation or training procedures to minimize some error for the training set. An example is the principal component analysis that performs an orthogonal rotation according to the directions with main variance in a given dataset. 2.1 Try the following code: B=randn(10,3); A=B*randn(3,100); %Generate 100-dim patterns with 3 degrees of freedom a=dataset(A’) %Import data to datastructure w=pca(a,4) struct(w) This just defines the mapping (‘trains’ it by a) for finding the first four principal components. In the PRTools-manual or by ’help mappings’ more information on mappings can be found. 2.2 The mapping w may be applied to a or to any 10-dimensional dataset a by: >> b1 = map(a,w) Instead of the routine map also the '*' operator may be used for applying mappings to datasets. Thus >> b2 = a*w would yield the same result. One may extract the compressed pictures in a matric B1 by B1=getdata(a); Note that the size of the variables a (100 x 10) and w (10 x 4) are such that the inner dimensionalities cancel in the computation of b1, like in all Matlab matrix operations. The '*' operator may also be used for training. a*pca is equivalent with pca(a) and a*pca([],4) is equivalent with pca(a,4). As a result an 'untrained' mapping can be stored in a variable: w = pca([],4). They may, thereby, also be passed as an argument in a fuction call. The advantages of this possibility will be shown later. 2.3 Perform plot(pca(a,0)) to see a plot of the relative cumulative ordered eigenvalues (normalized sum of variances). 2.4 Eigenfaces. The linear mappings used in example above may also be applied to image datasets in which each pixel is a feature, e.g. the Face-database containing images of 92*112 pixels. An image is now a point in a 10304 dimensional feature space. Load a subset of 10 classes (persons) by a = faces([1:10],[1:2]). The images can be displayed by show(a). 2.5 Plot the explained variance for the PCA as a function of the number of components. When and why reaches this curve the value 1? 6 2.6 The PCA eigenvector mapping w points to positions in the original feature space called eigenfaces. These can be displayed by show(w). Display the first 19 eigenfaces computed by pca. 2.7 Cluster the faces using the following lines of code: %Hierarchical clustering disp('Hierarchical clustering') D=distm(a) dentro=hclust(D); figure plotdg(dentro) title('Dendrogram of Faces') Here, totally 20 pictures of 10 persons have been clustered, 2 pictures each. Do the two pictures of each person cluster together? Do the women cluster together? 7