# Exercice 1 – The dataset structrue

```INL&Auml;MNINGSUPPGIFT 5
S&auml;rdragsselektion och verkliga
till&auml;mpningar
&copy; Signaler och System, Institutionen f&ouml;r Teknikvetenskaper, Uppsala Universitet
M&aring;ls&auml;ttning och f&ouml;rberedelser
M&aring;let med denna inl&auml;mningsuppgift &auml;r att du ska f&aring; f&ouml;rdjupade kunskaper om hur man
kan anv&auml;nda m&ouml;nsterigenk&auml;ningsmetoder f&ouml;r att l&ouml;sa praktiska problem med hj&auml;lp av
PRTools. Efter avslutat uppgift ska du kunna bygga ett helt system f&ouml;r datoriserad
m&ouml;nsterigenk&auml;nning f&ouml;r en verklig till&auml;mpning.
F&ouml;rberedelserna &auml;r samma som f&ouml;r inl&auml;mningsuppgift 1 dvs att ladda ned PRTools och
skapa en egen katalog myfiles, se inl&auml;mningsuppgift 1. Du beh&ouml;ver ocks&aring; ladda ned
materialet i filen kodOchDataInlupp5.zip.
Redovisning
1 IRIS DATA AND SUBSET FEATURE SELECTION
(a) Load the 4-dimensional three class Iris dataset by a = iris. To avoid problems with
different versions of PRTools, recreate the dataset by the command: a=dataset(a.data,
a.nlab). Make make scatterplots of all feature combinations using scatterd with the
option ‘gridded’ which produces subplots with all combinations of the four features.
(b) Plot in a separate figure the one-dimensional feature densities by means of plotf.
Identify visually the best combination of two features. Create a new dataset b using
seldat that contains just these two features. Create a new figure by the figure
command and create a scatterplot of b. Compare the first five rows of a and b to confirm
that b actually is the desired subset.
(b) Using the 4-dimensional Iris dataset a = iris again, design the three different
classifiers qdc, knnc and treec using 20 examples per class for design and test each with
the remaining examples. Calculate Bayesian posterior distributions for the unknown
probabilities of misclassification.
1
(c) Retrain the classifiers with the two features obtained with automatic selection using
the function featself and the variable crit set equal to crit='maha-s'. Hint: See help
featself and help feateval for more information. Use the test set to obtain performance
estmates. Calculate Bayesian posterior distributions of the unknown probabilities and
compare with (b). Do the distributions overlap or is there an obvious improvement?!
2 CLASSIFICATION OF RINGS, NUTS AND BOLTS
In this subtask a classification system will be developed for classifying objects that
belong to four different categories: ring, nut-6 (6-sided), nut-4 (4-sided), and bolt. A
vision system is available that acquires images of these objects. Using some digital image
processing techniques (not part of this project) the images are segmented. After that, each
imaged object is represented by a connected component in the resulting binary (logical)
image. Figure 1 shows already segmented images containing rings, nuts-6, nuts-4 and
bolts. These images are available for training and evaluation, thus providing us with a
labeled dataset of 121 objects per class.
The classification will be based on so-called normalized Fourier descriptors. Which
describe the shapes of the contours of the objects. The basic idea to consider the contour
as a periodic curve in 2D which can be represented by a Fourier series. The Fourier
descriptors are the coefficients of the Fourier series and may be used to create descriptors
that are invariant to rotation and scale. The software provided with the project can
produce many descriptors per objects. The goal of your design is to find a classifier that
strives for minimal error rate.
The software that is provided within this project can calculate up to 64 descriptors
denoted by Zk ,where k ranges from −31 up to +32 . The descriptors are normalized such
that they are independent from the orientation and the size. However, Z0 and Z1 should
not be used, because Z0 does not depend on the shape (but rather on the position) and Z1
is always one (because it is used for the normalization). The given Matlab function,
ut_contourfft, offers the possibility to calculate only a selection of the available
descriptors.
2
Each image in Figure 1 shows the segments of 121 objects. Thus, extraction of the
boundary of each segments, and subsequent determination of the normalized Fourier
descriptors yields a training set of 4&times;121=484 labeled vectors, each vector having 62
elements. An image can be transformed into a set of measurement vectors with the
following fragment of code:
fdlist = [-31:-1 2:32]; % exclude Z0 and Z1
figure; imshow(imrings); title('rings');
[BND,L,Nring,A] = bwboundaries(imrings,8,'noholes');% extract the boundaries
FDS = ut_contourfft(BND,'fdlist',fdlist,'nmag'); % calculate the FDs
Zrings = zeros(Nring,length(fdlist)); % allocate space
for n=1:Nring
Zrings(n,:) = FDS{n}'; % collect the vectors
end
Note that the function bwboundaries is part of the image processing toolbox in matlab so
you need a computer that has access to this toolbox to be able to import your data.
Similar pieces of code are needed to get the measurement vectors from the other classes.
The filenames of the four images are: rings.tif, nuts6.tif, nuts4.tif and bolts.tif. The
function ut_contourfft accompanies the images.
a) Write the code needed to import the data into Matlab and then into PRTools format.
Hint: use repmat to create the array with labels. For instance, repmat('ring',[Nring 1])
creates an array of Nring entries containing the string ‘ring’. Two data sets A and B may
be merged into one new dataset using C=[A;B];
b) Use 5-fold cross validation to estimate the average performance of the kNN classifier
for k=3 when the classifier is using the ten Fourier descriptors Z2,Z3,…,Z11.
c) Keep 21 examples per class as a holdout test set and use the rest for classifier design.
Use the result from the test set to calculate and plot the posterior P(q|k,N) for the
unknown performance q. Based on the plot, determine a Bayesian confidence interval
where there is a visible difference between P(e,k.N) and the coordinate axis.
d) Use the function confmat to compute a confusion matrix for the reduced data which
gives an impression which classes are likely to be confused.
(e*) Implement the greedy pairs feature selection algorithm when used together with the
kNN classifer for k=3 and apply it to the iris data set. NOTE: You do not have to write a
function that is compatible with the data structures in PRTools!! Simply write your
algorithm in matlab where in each iteration classification is performed using knnc using
the pair of features selected. Determine the optimal number of pairs of features according
to a five fold cross validation.
Good Luck!
3
```