inlämninigsuppgift2 dmi2005

advertisement
INLÄMNINIGSUPPGIFT2 DMI2005
Målet med denna inlämningsuppgift är att du ska bekanta dig med PRTool4, en toolbox i
skriven i Matlab för statistisk mönsterigenkänning från Delft i Holland. För att komma
igång måste du först ladda ned och packa upp arkivet kodOchManualInlupp2.zip som
finna att ladda ned på hemsidan. Efter extraktion av arkivet ska du ha en katalogstruktur
som innehåller följande:
En pdf fil som heter PRTools4.0.pdf och är manualen till toolboxen
En underkatalog som heter prtools4.0download050905 som innehåller toolboxen
En underkatalog som heter glue och innehåller filer som saknas i version 4.0 av
toolboxen som behövs för inlämningsuppgiften.
I denna katalog måste du till sist också ladda ned och spara katalogen datasetInlupp2
som innehåller bilder och andra data som används i de uppgifter ni ska utföra. Om du får
minnesproblem vid nedladdningen av denna katalog så finns den tillgänglig i katalogen
W:/dmi05. Den kan ni läsa direkt från matlab om ni sitter i datorsalen på Magistern.
Notera till sist att i instruktionerna nedan antas att du har lagt upp ovanstående filer i en
katalog enligt ovan och att du sedan har skapat en ytterligare underkatalog med namnet
myfiles i vilken du placerar dina egna matlabprogramfiler.
Redovisning sker genom att skicka in välkommenterad kod och en kort text i Word med
biler och text som visar att din kod funderar.
PRTools - EXERCISE I
In this first exercise about PRTools, you will learn how about how the toolbox is
organized around a particular datastructure called dataset and about basic operations
called mappings that one may perform on dataset objects.
Remember to add the following lines at the beginning of your code:
clear all
close all
%Add paths to the list of matlab paths
path=pwd
disp(['addpath ''' path ''''])
1
eval(['addpath ''' path ''''])
cd .. %jump up from your personal catalog called myfiles
cd prtools4.0download050905
path=pwd
disp(['addpath ''' path ''''])
eval(['addpath ''' path ''''])
cd .. %jump up again
cd datasets
path=pwd
disp(['addpath ''' path ''''])
eval(['addpath ''' path ''''])
cd ..
cd glue %Extra files from PRTools3.2.5 not available in PRTtools4.0
path=pwd
disp(['addpath ''' path ''''])
eval(['addpath ''' path ''''])
cd ..
cd myfiles % now back in the catalog myfiles
This clears all variables and closes all windows. Finally the catalog you are starting from
is added to the search path used by matlab to check for functions to run and data to load.
Task 1 –The data structure dataset
PRTools entirely deals with sets of objects represented by vectors in a feature space. The
central data structure is called dataset. It consist of a matrix of size m x k; m row
vectors representing the objects given by k features each. Attached to this matrix is a set
of m labels (strings or numbers), one for each object and a set of k feature names (also
strings or numbers), one for each feature. Moreover, a set of prior probabilities, one for
each class, is stored. Objects with the same label belong to the same class. In most help
files in PRTools, a dataset is denoted by A. Almost all routines can handle multiclass
objects. Some useful routines to handle datasets are:
dataset Define dataset from data matrix and labels
getdata Retrieve data from dataset
getlab Retrieve object labels
getfeat Retrieve feature labels
seldat Select a subset of a dataset
2
genlab Generate dataset labels
setdat Define a new dataset from an old one by replacing its data
renumlab Convert labels to numbers
Sets of objects may be given externally or may be generated by one of the data generation
routines in PRTools. Their labels may be given externally or may be the results of a
classification or a cluster analysis. A dataset containing 10 objects with 5 random
measurements can be generated by:
>> data = [rand(4,5);randn(2,5);rand(4,5)+1]
>> a = dataset(data)
10 by 5 dataset with 1 classes: [10]
1.1 Write a m-file that runs these rows and then check the organization of a by means of
>> struct(a).
In this example no labels are supplied, therefore only one class is detected with all 10
objects (examples) included.
1.2 Split the 9 examples into three classes by means of the following labeling:
>> labs = [1 1 1 1 2 2 3 3 3 3]'; % labs should be a column
vector
>> a = dataset(a,labs)
10 by 5 dataset with 3 classes: [4 2 4]
Note that the labels have to be supplied as a column vector.
1.3 A more convenient way to assign labels to a dataset is offered by the routine genlab
in combination with the Matlab char command. Run the following lines to split the 10
examples into the 3 classes ‘apple’, ‘pear’, and ‘banana’ :
>> labs = genlab([4 2 4],char('apple','pear','banana'))
>> a = dataset(a,labs)
10 by 5 dataset with 3 classes: [4 4 2]
Note that the class names are sorted in alphabetical order, Hence [4 4 2] denotes the number of
objects in each class.
1.4 Use the routine setfeatlab to give the five features in a the names ‘feature1’,
‘feature2’,’feature3’,’feature4’, and ‘feature5’, respectively. Then use the routines
getlab and getfeat to retrieve the object class labels and the feature labels of a.
1.5 The fields of a dataset can be made visible by converting it to a structure, e.g.:
>> struct(a)
3
data: [10x5 double]
lablist: [3x6 char]
nlab: [10x1 double]
labtype: ’crisp’
targets: []
featlab: [5x1 double]
featdom: {[] [] [] [] []}
prior: []
objsize: 10
featsize: 5
ident: {10x1 cell}
version: {[1x1 struct] ’31-Jan-2003 10:02:55’}
name: []
user: []
Check the on-line information on datasets (help datasets, also printed in the PRTools
manual) where the meaning of these fields is explained. Each field may be changed by a
set-command e.g. one may replace the data set:
>> b = setdata(a,rand(10,5));
Our should try this command and then try to retrieve a field value by a similar getcommand, e.g.
>> classnames = getlablist(a)
In nlab an index is stored for each object to the list of class names lablist. Note that
this list is alphabetically ordered. The size of a dataset can be found by both, size and
getsize. Write:
>> [m,k] = size(a);
>> [m,k,c] = getsize(a);
Confirm that number of objects is returned in m, the number of features in k and the
number of classes in c. The class prior probabilities are stored in prior. It is by default
set to the class frequencies if the field is empty. The data in the data itself can also be
retrieved by double(a) or more simple by +a, try both.
1.6 Run once more the commands
>>
>>
>>
>>
data = [rand(4,5);randn(2,5);rand(4,5)+1]
a = dataset(data)
labs = genlab([4 2 4],char('apple','pear','banana'))
a = dataset(a,labs)
Have a look of the help-information of seldat. Use the routine to extract the banana
data from a and check this by inspecting the result of +a. Note that classes and features
need to be specified by their indices and that the class names are sorted in alphabetical order.
4
1.7 One way to inspect a dataset is to make a scatterplot of the objects in the dataset. For
this the function scatterd is supplied. This plots each object in a dataset in a 2D graph,
using a colored marker when class labels are supplied. When more than two features are
present in the dataset, the first two are used. For obtaining a scatterplot of two other
features they have to be explicitly extracted first, e.g. a1 = a(:,[2 5]); It is also
possible to create 3D scatterplots. Use scatterd to make a scatterplot of the features
2 and 5 of dataset a. Open a new figure using the figure command and try also
scatterdui. Use its buttons to study what happens for different features, in particular
when the same feature is used on both axes.
1.8 Open a new figure and make a 3-dimensional scatterplot by scatterd(a,3) and
try to rotate it.
1.9 Load the 4-dimensional Iris dataset by a = iris and make scatterplots of all
feature combinations using the gridded option of scatterd. Plot in a separate figure
the one-dimensional feature densities by means of plotf. Identify visually the best
combination of two features. Create a new dataset b using seldat that contains just
these two features. Create a new figure by the figure command and plot in this a
scatterplot of b. Compare the first five rows of a and b to confirm that b actually is the
desired subset.
1.10 Generate your own data set that that consists of two 2-D uniformly distributed
classes of objects using the rand command. Transform the sets such that for the [xmin
xmax; ymin ymax] intervals the following holds: [0 2; -1 1] for class 1 and [1
3; 1.5 3.5] for class 2. Generate 50 objects for each class. An easy way is to do this
for x and y coordinates separately and combine them afterwards. Label the features by
'area' and 'perimeter'. Check the result by scatterd and by retrieving object labels
and feature labels.
1.11 Load a part of a face dataset by a = faces([1 3],[1:4]) which reads the first
four pictures of the subjects 1 and 3. They may be displayed by show(a). Include class
labels by x = dataset(a,genlab([4 4]).
Task 2 –The data structure mapping
In PRTools datasets are transformed by mappings. These are procedures that map a set of
objects form one space into another. Examples are feature selection, feature rescaling,
rotations of the space, classification. In a mapping (we use almost everywhere the
variable w for mappings) various information is stored, like the dimensionalities of input
5
and output space, parameters that define the transformation and the routine that is used
for executing the transformation. Give struct(w)to see all fields and print out the mfile mapping.m to get more information about the result from a mapping operation.
Often a mapping has to be trained, i.e. it has to be adapted to a training set by some
estimation or training procedures to minimize some error for the training set. An example
is the principal component analysis that performs an orthogonal rotation according to the
directions with main variance in a given dataset.
2.1 Try the following code:
B=randn(10,3);
A=B*randn(3,100); %Generate 100-dim patterns with 3 degrees of freedom
a=dataset(A’) %Import data to datastructure
w=pca(a,4)
struct(w)
This just defines the mapping (‘trains’ it by a) for finding the first four principal components. In
the PRTools-manual or by ’help mappings’ more information on mappings can be found.
2.2 The mapping w may be applied to a or to any 10-dimensional dataset a by:
>> b1 = map(a,w)
Instead of the routine map also the '*' operator may be used for applying mappings to
datasets. Thus >> b2 = a*w would yield the same result. One may extract the compressed
pictures in a matric B1 by B1=getdata(a);
Note that the size of the variables a (100 x 10) and w (10 x 4) are such that the inner
dimensionalities cancel in the computation of b1, like in all Matlab matrix operations.
The '*' operator may also be used for training. a*pca is equivalent with pca(a) and
a*pca([],4) is equivalent with pca(a,4). As a result an 'untrained' mapping can be
stored in a variable: w = pca([],4). They may, thereby, also be passed as an argument
in a fuction call. The advantages of this possibility will be shown later.
2.3 Perform plot(pca(a,0)) to see a plot of the relative cumulative ordered
eigenvalues (normalized sum of variances).
2.4 Eigenfaces. The linear mappings used in example above may also be applied to image
datasets in which each pixel is a feature, e.g. the Face-database containing images of
92*112 pixels. An image is now a point in a 10304 dimensional feature space. Load a
subset of 10 classes (persons) by a = faces([1:10],[1:2]). The images can be
displayed by show(a).
2.5 Plot the explained variance for the PCA as a function of the number of components.
When and why reaches this curve the value 1?
6
2.6 The PCA eigenvector mapping w points to positions in the original feature space
called eigenfaces. These can be displayed by show(w). Display the first 19 eigenfaces
computed by pca.
2.7 Cluster the faces using the following lines of code:
%Hierarchical clustering
disp('Hierarchical clustering')
D=distm(a)
dentro=hclust(D);
figure
plotdg(dentro)
title('Dendrogram of Faces')
Here, totally 20 pictures of 10 persons have been clustered, 2 pictures each. Do the two
pictures of each person cluster together? Do the women cluster together?
7
Download