Exercice 1 – The dataset structrue

advertisement
INLÄMNINGSUPPGIFT3 DMI2005
Målet med denna inlämningsuppgift är att lära sig hur man kan bygga och utvärdera
klassificerare i Matlab med hjälp PRTool4, en toolbox i skriven i Matlab för statistisk
mönsterigenkänning från Delft University of Technology i Holland. För att komma
igång måste du (om du inte redan gjort det) först ladda ned och packa upp arkivet
kodOchManualInlupp2.zip som finna att ladda ned på hemsidan. Efter extraktion av
arkivet ska du ha en katalogstruktur som innehåller följande:
En pdf fil som heter PRTools4.0.pdf och är manualen till toolboxen
En underkatalog som heter prtools4.0download050905 som innehåller toolboxen
En underkatalog som heter glue och innehåller filer som saknas i version 4.0 av
toolboxen som behövs för inlämningsuppgiften.
Innan du går vidare: För att kunna göra denna uppgift på ett bra sätt bör du först ha lärt
dig hur PRTools är uppbyggd och känna till strukturerna dataset och mapping med
tillhörande tillåtna operationer. Denna uppgift består av två delar. I den första delen är
målet att lära sig hur man designar och utvärderar klassificerare i PRTools baserat på
simulerade exempel. Filer som behövs för att lösa dessa uppgifter finns att ladda ned
från kurshemsidan. I den andra delen är målet att lära sig hur man kan använda
PRTools på verkliga problem. Data och filer som behövs för att försöka lösa dessa
verkliga problem finns också att ladda ned från kurshemsidan.
PRTools - EXERCISE II
In Exercise I, you learned how about how the toolbox is organized around a particular
datastructure called dataset and about basic operations called mappings that one may
perform on dataset objects. Before starting this time, remember again to add the
following lines at the beginning of your code:
clear all
close all
%Add paths to the list of matlab paths
path=pwd
disp(['addpath ''' path ''''])
eval(['addpath ''' path ''''])
cd .. %jump up from your personal catalog called myfiles
cd prtools4.0download050905
1
path=pwd
disp(['addpath ''' path ''''])
eval(['addpath ''' path ''''])
cd .. %jump up again
cd datasets
path=pwd
disp(['addpath ''' path ''''])
eval(['addpath ''' path ''''])
cd ..
cd glue %Extra files from PRTools3.2.5 not available in PRTtools4.0
path=pwd
disp(['addpath ''' path ''''])
eval(['addpath ''' path ''''])
cd ..
cd myfiles % now back in the catalog myfiles
This clears all variables and closes all windows. Finally the catalog you are starting from
is added to the search path used by Matlab to check for functions to run and data to load.
Finally, simply add the line
GridResolution=gridsize(100)
which increases the resolution in contour plots below from the default grid resolution
gridsize(30).
Task 1 – Classifier Design and Testing Using Simulated Data
A special case of a mapping is a classifier. It maps (transforms) a dataset into score
values or class posterior probability estimates. They can be used in an untrained as well
as in a trained mode. When applied to a dataset, in the first mode the dataset is used for
training and a classifier is generated, while in the second mode the dataset is classified.
Unlike mappings, fixed classifiers don’t exist. Some important classifiers are:
fisherc Fisher classifier
qdc Quadratic classifier assuming normal densities
udc Quadratic classifier assuming normal uncorrelated densities
ldc Linear classifier assuming normal densities with equal covariance matrices
nmc Nearest mean classifier
parzenc Parzen density based classifier
2
knnc k-nearest neighbor classifier
treec Decision tree
svc Support vector classifier
lmnc Neural network classifier trained by the Levenberg-Marquardt rule
In this task, simulated data sets will be generated to design and test different
classification algorithms.
3.1 CLASSIFIER DESIGN, DECISION BOUNDARIES, APPARENT ERRORS
Let us begin with a simple example which illustrates classifier design and how to plot
decision boundaries in 2D scatterplots by plotc. Generate a dataset, make a scatterplot,
train and plot some classifiers by
D = gendath([20 20]);
scatterd(D)
w1 = fisherc(D);
w2 = knnc(D,3);
w3 = qdc(D);
plotc({w1,w2,w3})
Plot in a new scatterplot of the dataset D a series of classifiers computed by the k-NN
rule (knnc) for various values of k between 1 on 10. Look at the influence of the
neighborhood size on the classification boundary.
Now we continue with another example: Generate a dataset A by gendath 60 examples
from each class. Use gendat to split A into one set of 30 examples per class called
TrainSet and another set with the remaining examples called TestSet. Write help
gendath to read about how the dataset is generated from two normal distributions.
Compute the Fisher classifier by wFisher = fisherc(TrainSet). Make a
scatterplot of TrainSetand plot the classifier decision boundary by plotc(wFisher).
Apply the Fisher projection to the training set by mappedTrainSetFisher =
map(TrainSet,wFisher). Compute class assignments by
classAssignmentsTrainSetFisher=labeld(mappedTrainSetFisher)'.
Extract the true assignments by trueAssignmentsTrainSet=TrainSet.nlab'. Calculate
the number of apparent (training) errors by:
noOfApparentErrorsFisher=sum(classAssignmentsTrainSetFisher~=trueAssignmen
tsTrainSet)
Also use testc to compute the apparent error:
[e_apparentFisher,noOfErrorsPerClassTestSetFisher]=testc(mappedTrainSetFisher
)
Repeat for the support vector machine classifier svc, the kNN classifier knnc (k=3), and
the model based quadratic classifier qdc. After designing all the classifiers, make a
3
scatterplot of a again and plot all the decision boundaries in the same figure using the
commands
plotc(wFisher,'k')
plotc(wkNN,'m:')
plotc(wsvc,'b--')
plotc(wqc,'g-.')
Try to sort out which boundaries are associated with each classifier. Insert small text
boxes in the plot to indicate close to each boundary what classifier it is associated
with. Save the resulting graph for your report. Note that some of the classifiers have
two or even more decision bondaries whereas Fisher and the support vector machine have
only one each.
3.2 CLASSIFIER DESIGN AND TEST ERRORS
Repeat 3.1 but test the classifiers using the test examples instead. Thus the apparent
errors will be replaced by test errors and the decision boundaries will be displayed
together with the test examples.
3.3 TRUE ERRORS – TESTING USING MANY EXAMPLES
Generate 10000 test examples per class by TestBig=gendath([10000 10000]); and
determine accurate estimates of the performance of e.g. Fishers classifier by
mappedTestBigFisher = map(TestBig,wFisher);
[e_trueFisher,noOfErrorsPerClassTtrueFisher]=testc(mappedTestBigFisher)
Repeat this for all the classifiers designed.
3.4 RECEIVER OPERATOR CHARACTERISTICS
Write help roc to understand more about how to create receiver operator characteristic
(ROC) curves and then write code which makes it possible to calculate ROC curves for
the classifiers designed. For example, to determine the ROC for the Fisher classifier one
has to write something like
EFisher=roc(TestSet,wFisher,desiredClass,noOfPointsInROC);
This results in a structure EFisher that can be used to generate a ROC curve simply by
plotr(EFisher)
Try this and you will find that the ROC curves do not display the probability of detection
as a function of the probability of false alarm. To obtain such a graph, one has to write
the following lines:
EFisher.error=1-EFisher1.error;
4
EFisher.ylabel='P_{D}';
EFisher.xlabel='P_{FA}';
Now plotr(EFisher) will result in a familiar ROC curve with the expected quantities
written out on the axes.
Compute ROC curves for all the classifers. Since the plotr command can take several
ROC curves, it is possible to present them in the same graph using a command like
plotr({EFisher,EkNN,Eqc}); where EkNN and Eqc are the ROCs for kNN and the
quadratic classifier, respectively.
As you should note, the resulting ROC curves are not smooth at all and they are unstable:
Repeat the data generation a few times and study how the curves are changing.
Repeat the above procedure but use the test set TestBig instead. Now the ROC curves
should become very smooth.
3.5 MULTI-CLASS PROBLEMS
In many cases, we consider binary (two class) classification but often there are more than
two classes. Try to understand and run the following lines of code which generates data
from four classes, present the examples in a scatterplot together with decision boundaries
and colored decision regions, one for each class:
A=gendath([20,20])
dataA=getdata(A);
B=gendath([20,20]);
dataB=getdata(B);
dataC=[dataA;dataB+5]; %Adding 5 to all features
labsC=genlab([20 20 20 20],[1 2 3 4]');
C=dataset(dataC,labsC);
C=setname(C,'A four class data set');
figure
scatterd(C); %make scatter plot for right size
drawnow;
w=qdc(C);
plotc(w,'col')
hold on
scatterd(C)
hold off
5
3.6 CROSS VALIDATION
Try to understand how to use the function crossval by running the following code which
performs 5-fold cross validation:
ClassifiersToTest={qdc,knnc([],3)}
noOfSplits=5;
ProgressReportFlag=1;
C=setprior(C,0); %All classes equally likely
%Priors needed to avoid warnings from PRTools
noOfReps=1; %Note: Not possible to calculate std in this case
[WeightedAverageTestError,TestErrorsPerClass,AssignedNumericLabels,] =
crossval(C,ClassifiersToTest,noOfSplits,noOfReps,ProgressReportFlag);
TestErrorsPerClass_qsc=TestErrorsPerClass{1}
TestErrorsPerClass_kNN=TestErrorsPerClass{2}
See help crossval and try to run repeatedly crossval 30 times to determine the standard
deviation of 5-fold cross validations.
3.6 LEARNING CURVES
A learning curve displays the estimated performance as a function of the number of
training examples. Run the following code to determine learning curves for the Fisher
and kNN (k=3) classifiers:
A=gendath([200 200]) %200 examples per class, below up to 100 of them are used
%for training
noOfRep=10;
ProgressFlag=1;
trainingSizesPerClass=[4 6 8 10 12 14 16 18 20 30 40 50 60 70 80 85 90 95 100];
figure
eClassifiers=cleval(A,{fisherc, knnc([],3),
qdc},trainingSizesPerClass,noOfRep,[],ProgressFlag);
plotr(eClassifiers,'errorbar')
Repeat the above calculations for the more demanding situation where the number of
examples available is only 20 per class. Determine and plot the learning curve for the
following training sizes [4 6 8 10 12 14 16 18]. Why does the variance increase
dramatically when almost all examples are used for training?
6
Task 2 – Classifier Design and Testing Using Real World Data
In this task, you are going to design and test classifiers aimed at solving five real world
classification problems.
2.1 CLASSIFICATION OF IRIS DATA AND SUBSET FEATURE SELECTION
Load the 4-dimensional Iris dataset by a = iris. Design three different types of
classifiers using 20 examples per class for design and test each with the remaining
examples. Also determine an estimate of the performance after automatic selection of
only two features using the function featself and the variable crit set equal to
crit='maha-s',see help featself and help feateval for more information.
2.2 CLASSIFICATION OF RINGS, NUTS AND BOLTS
In this subtask a classification system will be developed for classifying objects that
belong to four different categories: ring, nut-6 (6-sided), nut-4 (4-sided), and bolt. A
vision system is available that acquires images of these objects. Using some digital image
processing techniques (not part of this project) the images are segmented. After that, each
imaged object is represented by a connected component in the resulting binary (logical)
image. Figure 1 shows already segmented images containing rings, nuts-6, nuts-4 and
bolts. These images are available for training and evaluation, thus providing us with a
labeled dataset of 121 objects per class.
The classification will be based on so-called normalized Fourier descriptors. Which
describe the shapes of the contours of the objects. The basic idea to consider the contour
as a periodic curve in 2D which can be represented by a Fourier series. The Fourier
descriptors are the coefficients of the Fourier series and may be used to create descriptors
that are invariant to rotation and scale. The software provided with the project can
produce many descriptors per objects. The goal of your design is to find a classifier that
strives for minimal error rate.
7
The software that is provided within this project can calculate up to 64 descriptors
denoted by Zk ,where k ranges from −31 up to +32 . The descriptors are normalized such
that they are independent from the orientation and the size. However, Z0 and Z1 should
not be used, because Z0 does not depend on the shape (but rather on the position) and Z1
is always one (because it is used for the normalization). The given Matlab function,
ut_contourfft, offers the possibility to calculate only a selection of the available
descriptors.
Each image in Figure 1 shows the segments of 121 objects. Thus, extraction of the
boundary of each segments, and subsequent determination of the normalized Fourier
descriptors yields a training set of 4×121=484 labelled vectors, each vector having 62
elements. An image can be transformed into a set of measurement vectors with the
following fragment of code:
fdlist = [-31:-1 2:32]; % exclude Z0 and Z1
imrings = imread('rings.tif'); % open and read the image file
figure; imshow(imrings); title('rings');
[BND,L,Nring,A] = bwboundaries(imrings,8,'noholes');% extract the boundaries
FDS = ut_contourfft(BND,'fdlist',fdlist,'nmag'); % calculate the FDs
Zrings = zeros(Nring,length(fdlist)); % allocate space
for n=1:Nring
Zrings(n,:) = FDS{n}'; % collect the vectors
end
Note that the function bwboundaries is part of the image processing toolbox in matlab so
you need a computer that has access to this toolbox to be able to import your data.
Likewise pieces of code are needed to get the measurement vectors from the other
classes. The filenames of the four images are: rings.tif, nuts6.tif, nuts4.tif and bolts.tif.
The function ut_contourfft accompanies the images.
a) Write the code needed to import the data into Matlab and then into PRTools format.
Hint: use repmat to create the array with labels. For instance, repmat('ring',[Nring 1])
creates an array of Nring entries containing the string ‘ring’.
b) Use 5-fold cross validation to estimate the performance of the kNN classifier for k=3
when the classifier is using the ten Fourier descriptors Z2,Z3,…,Z11 or some other subset
of features.
c) Try to understand and use the function confmat to compute a confusion matrix which
gives an impression which classes are likely to be confused.
2.3 RECOGNITION OF VOICE COMMANDS
In this subtask a classification system will be developed for classification of your own
voice commands. Using a headset and a microphone, you will be able to record and store
8
training and test examples of your own voice commands in Matlab format. First you
should a headset available in our computer lab to create a database with recorded
examples of the same words and then use them for feature extraction, training, and
testing, respectively. One set of recorded and stored examples are used s prototypes in a
first step to perform feature extraction of new inputs. As you know, the feature
extraction step is often the most important in computerized pattern recognition. We use
dynamic time warping (DTW) together with the prototype examples in the first
database to create a six-dimensional informative feature vector for each new input
pattern. DTW is a dynamic programming algorithm that calculates the distance between
time profiles. (In DTW, an unknown time profile is aligned to prototype time profiles in
an N x M grid where N and M are the lengths of the time profiles. The distance is defined
as the length of the shortest alignment path found.). DTW is not included in the PRTools
toolbox but available via the Matlab function featureextraction.
The second part of the database is used in a second step to train and test a classifier based
on the features extracted from each example. In other words, each original sound
pattern in the set of training and test examples are converted into a short feature vector by
means of the feature extraction method discussed above.
Main Steps
Here follows a short summary of the different steps needed to complete this project.
Step 1: Preparations
Here we summarize the first important steps:
1. Download the zip-library and install it in one of your directories.
2. Get a headset to be used for the recording of sounds.
3.
Write help featureextraction and figure out how to use the function.
Step 2: Sound Recording and Recall
See the help for Matlab functions wavrecord and wavplay. Get a headset and
implement a couple lines of code that record and play back a 1.5 s soundbite. You need
to use double as the datatype for wavrecord.
Run the code and confirm that the program is able to record and play the word you are
expressing. If you are unable to get the sound to work properly from Matlab or if you find
any errors in the provided Matlab send an e-mail canders@lcb.uu.se so that we can
correct the problems as soon as possible.
Step 3: Recroding and Storing Multiple Examples
An important task is to record and store multiple copies of each word of interest. Write an
m-file called recordWords.m which can be used to record several examples (utterances)
of a single word and store them as columns in a matrix called WordMatrix. Each
9
column should correspond to a recorded time interval of 1.5 seconds. In other words a
call should look like WordMatrix=recordWords(WORDS,noOfWordExamples)
where WORDS is a cell array containing the words written explicitly in letters and where
noOfWordExamples define the number of examples of each word to be recorded. You
should also write a function playWords.m which makes it possible to listen to the words
colleced in a matrix.
Step 4: Create Three Word Databases
Create a database containing 7 examples each of the words RETURN, LEFT and RIGHT.
The database should be in the form of a matrix called WordsForFeatureExtraction
in which each of the 21 columns is an example.Also the class membership of each word
should be stored in a row vector called targetsForFeatureExtraction. Each element in
the vector should be an integer from the set {1,2,3} corresponding to RETURN, LEFT,
and RIGHT, respectively.
Create another database containing 10 examples of each of of the words. Use
featureextraction to extract features from this database and use the output (6x30) to
create a PRTools dataset with appropriately labeled classes.
a) Turn the training and test examples into PRTools format and design two classifiers of
your choice using the training examples.
b) Use the pca mapping and scatterd in PRTools to visualize the samples in two and
three dimensions. Does separation seem plausible?
c) Test the designed classifiers using your test examples and report the corresponding
confusion matrix.
Good Luck!
10
Download