TOOLDIAG % ************************************** % % A

advertisement
%
%
**************************************
%
TOOLDIAG
%
**************************************
%
%
A experimental pattern recognition package
%
%
Copyright (C) 1992, 1993, 1994 Thomas W. Rauber
%
Universidade Nova de Lisboa & UNINOVA - Intelligent Robotics Center
%
Quinta da Torre, 2825 Monte da Caparica, PORTUGAL
%
E-Mail: tr@fct.unl.pt
%
%
% INTRODUCTION
% -----------%
%
TOOLDIAG is a experimental package for the analysis and
visualization
% of sensorial data. It permits the selection of features for a
supervised
% learning task, the error estimation of a classifier based on
continuous
% multidimensional feature vectors and the visualization of the data.
% Furthermore it contains the Q* learning algorithm which is able to
generate
% prototypes from the training set. The Q* module has the same
functionality
% as the LVQ package described below.
%
Its main purpose is to give the researcher in the field a feeling
for
% the usefulness of sensorial data for classification purposes.
%
The visualization part of the program is executed by the program
GNUPLOT.
%
The 'TOOLDIAG' package has an interface to the classifier program
% 'LVQ_PAK' of T. Kohonen of Helsinki University and his programming
team.
% TOOLDIAG uses exactly the same data file format as LVQ_PAK.
%
Also the Stuttgart Neural Network Simulator (SNNS) can be
interfaced.
% It is possible to generate standard network definition files and
pattern
% files from the input files. Therefore TOOLDIAG can be considered as a
% preprocessor for feature selection (use only the most important data
for
% the training of the network).
% See the file 'other.systems' how to get a copy of the interfaced
systems.
%
%
HOW TO OBTAIN TOOLDIAG
%
---------------------%
% The TOOLDIAG software package can be obtained via anonymous, binary
ftp.
%
Server:
ftp.fct.unl.pt (192.68.178.2)
%
Directory: pub/di/packages
%
File:
tooldiag<version>.tar.Z
%
>>>>>>>>>>>>>>>>>>>>>>
END OF FILE "README"
<<<<<<<<<<<<<<<<<<<<<
REFERENCE
--------In the current version, a great emphasis is put on feature selection.
Many
methods of the following book have been implemented:
Devijver, P. A., and Kittler, J., "Pattern Recognition --- A Statistical
Approach," Prentice/Hall Int., London, 1982.
Originally only three feature selection algorithms were implemented.
The
application area was the monitoring and supervision of production
processes,
using inductive learning of process situations. The reference for these
three
original algorithms is:
"A Toolbox for Analysis and Visualization of Sensor Data in Supervision"
T.W. Rauber, M.M. Barata and A.S. Steiger-Gar\c{c}\~{a}o
Proceedings of the International Conference on Fault Diagnosis,
April 1993, Toulouse, France.
A Postscript file of the paper is also available on the same server
ftp.fct.unl.pt at directory pub/di/papers/Uninova with name tooldiag.ps.Z
This file is a compressed UNIX postscript version and should print
properly
on every postscript printer.
The Q* - algorithm is described in
Rauber, T. W., Coltuc, D., and Steiger-Gar\c{c}\~{a}o, A. S.,
"Multivariate discretization of continuous attributes for machine
learning,"
in K. S. Harber (Ed.), Proc. 7th Int. Symposium on Methodologies for
Intelligent Systems (Poster Session), Trondheim, Norway, June 15-18,
1993,
Oak Ridge National Laboratory, ORNL/TM-12375, Oak Ridge, TN, USA, 1993.
(Avaliable as ismis93.ps.Z at the same site)
(In order to avoid too much information, the algorithm can/should be
considered independent from the context of the paper.)
COPYRIGHT
--------BECAUSE "TOOLDIAG" AS DOCUMENTED IN THIS DOCUMENT IS LICENSED FREE OF
CHARGE, I PROVIDE ABSOLUTELY NO WARRANTY, TO THE EXTENT PERMITTED BY
APPLICABLE STATE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING, I
THOMAS W. RAUBER PROVIDE THE "TOOLDIAG" PROGRAM "AS IS" WITHOUT
WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT
LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR
PURPOSE.
THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH
YOU. SHOULD THE "TOOLDIAG" PROGRAMS PROVE DEFECTIVE, YOU ASSUME THE COST
OF ALL
NECESSARY SERVICING, REPAIR OR CORRECTION.
THE "TOOLDIAG" PROGRAMS ARE FOR NON-COMMERCIAL USE ONLY, EXCEPT WITH
THE
WRITTEN PERMISSION OF THE AUTHOR.
INSTALLATION
-----------unix> uncompress tooldiag<version>.tar.Z
unix> tar xf tooldiag<version>.tar
COMPILATION
----------Edit the file "def.h" in the "src" directory and change the predefined
variable DATA_DIR to the directory to which all output of the package
should be directed (Naming is different in DOS and UNIX).
Then type 'make' in UNIX.
Some machines need slight adjustments, e.g. if the C-Library has only
the functions 'rand()' and 'srand()' define the macro ONLY_RAND,
or the memory allocation routine 'malloc()' is defined differently.
In DOS create a project and include all .c files
Use a 'large' model for compilation. Uncomment the line in the file dos.h
which contains the macro #define DOS .
SETUP
----Ensure that the program(s) that are called from tooldiag are included
into your current PATH variable.
E.g. in UNIX adapt the file 'setup' and execute it (under UNIX and bash
with the command:
. setup).
This is especially important for the GNUPLOT program.
EXECUTION
--------The program can be called with or without command line options. If it
is
called without options it will automatically ask for the respective data
files.
The following command line options are available:
tooldiag [-v] [-dir <data-directory> | -file <data-file>]
[-sel <selected-features-file>]
[-fnam <feature-names-file]
-v
TOOLDIAG outputs control messages (files saved etc.)
-sel
Load a set of already selected features from a file
-fnam
Description file of the features (optional)
-dir
Use the name of the following directory as the data input
directory.
<data-directory> is the name of the data directory.
--- or ---file
Use the name of the following file for the input data
<data-file> is the name of that file (full path).
Example: tooldiag -dir /usr/users/tr/ai/tooldiag/universes/iris/
will call the program and load the data files in the specified directory.
- FILEFORMAT OF THE DATA FILES
****************************
The example of the iris flower data provided by Fisher in 1936 is used as
an
illustration of the file formats.
You have 2 options:
1.) Load data from directory
---------------------------The directory in which the data files are located must not contain
other
files than the data files. E.g. for the examples of the iris flower data
the
directory has only three files. This implies that each class is
represented
by one file.
Each data file must have the following schema:
<class_name>
<dimension_of_the_feature_vector>
<number_of_samples>
<feature1_sample1 feature2_sample1 ... featureN_sample1>
...
<feature1_sampleM feature2_sampleM ... featureN_sampleM>
The data up to the feature values are ASCII characters.
The feature values can be ASCII characters or binary floating point
numbers
(advantageous for very large data files).
Note that the binary values are machine dependent.
1.) The 'class_name' defines the name for each class. No two classes may
have
the same name. E.g. 'setosa' is the name for one class of iris flower
data.
2.) The dimension of the feature vector. In the iris case 4 features are
given.
3.) How many samples are provided in this file? 50 should be specified
for
each of the 3 flower classes.
4.) The values of the features for each sample.
50 lines with 4 real numbers each would have to be specified here.
(ASCII case)
Example:
The directory /usr/users/tr/ai/tooldiag/universes/iris/samples/
contains 3
files: iris1.dat iris2.dat iris3.dat
The file iris3.dat looks like this (feature values in ASCII):
virginica
4
50
6.3 3.3 6.0 2.5
5.8 2.7 5.1 1.9
... (48 more lines)
...
...
2.) Load data from a single file
-------------------------------This file format is compatible with that of the LVQ_PAK package of
Kohonen.
All input data is stored in a single file with the following format:
<dimension_of_the_feature_vector>
<feature1_sample1 feature2_sample1 ... featureN_sample1> <class_name_i>
...
<feature1_sampleM feature2_sampleM ... featureN_sampleM> <class_name_j>
Example:
The file iris.dat looks like this
4
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
...
5.0 3.3 1.4 0.2 setosa
7.0 3.2 4.7 1.4 versicolor
6.4 3.2 4.5 1.5 versicolor
...
5.1 2.5 3.0 1.1 versicolor
5.7 2.8 4.1 1.3 versicolor
6.3 3.3 6.0 2.5 virginica
5.8 2.7 5.1 1.9 virginica
...
6.2 3.4 5.4 2.3 virginica
5.9 3.0 5.1 1.8 virginica
Also here the feature values can be ASCII characters or binary floating
point numbers. In the case of the binary feature values the class name
follows
directly after the last byte of the last feature value for that sample.
After the class name a 'new line' character must appear.
Example for binary file:
.........<byte><byte>setosa<new-line><byte><byte>...
- FILEFORMAT OF THE FEATURE SELECTION FILE
****************************************
After a subset of all features has been selected, the indices of the
features
are stored in a text file. The first line contains the NAME of the data
file or
directory from which the data comes. The second line contains the NUMBER
OF
SELECTED FEATURES. Comment lines can be introduced, starting with a
comment
character '#'. The next line indicates if the data was NORMALIZED to
[0,1]
before the features were selected or UNNORMALIZED.
The following lines contain the feature INDICES together with the score
for
the SELECTION CRITERION of the feature selection algorithm.
Example: "iris.sel"
iris.dat
2
# Were the feature values normalized to [0,1] during selection ?
unnormalized
2
0.120000
3
0.046667
- FILEFORMAT OF THE FEATURE DESCRIPTION FILE
******************************************
This (optional) file contains the names of the features. In the first
line the
number of features must appear. This number must be equal to the number
of
features of the data file(s). Then in each line one feature name appears.
The
name must be a single connected string.
If available, the feature description is forwarded to the SNNS 'net'
file.
For example in the case of the iris flowers we have the file "iris.nam" :
4
sepal_length
sepal_width
petal_length
petal_width
MENUS
=====
In this section the functionality of the main menu and its submenus is
outlined.
Load universe from directory or file
-------------------------------------Load the feature data.
Normalize data to [0,1]
------------------------A normalization to all features is applied using the linear transform:
new_value = (old_value - min ) / (max - min).
This has the effect of scaling the values of all classes to the
interval
0.0 to 1.0. This is especially useful if the differences in the values
of
the different feature values is very big.
The normalization is done sequentially for each feature, i.e.
univariately.
Feature selection
-------------------
Different search strategies are combined with different selection
criteria.
Besides auxiliary functions, like loading or saving of selected
features.
Feature extraction
------------------Linear feature extraction is implemented here. A matrix is calculated
which
maps the original samples to new samples with a lower dimension. The
new
features are linear combinations of the old features.
Learning with Q* - algorithm
---------------------------The totality of all samples is compressed to a small set of
representative
prototypes. This module performs the same function as the LVQ algorithm
of Kohonen.
Q* learns a set of representative prototypes which compress the raw
data
considerably. The effort to obtain a optimal prototype set is
proportional
to the complexity of the statistical distribution of the data. For
instance
the "setosa" class of the iris data needs only one prototype, since it
is
linearly separable from the other two classes. The other two classes
"virginica" and "versicolor" need more prototypes since they overlap.
The original algorithm described in the paper updates the new
prototypes as
the MEAN of all samples that were classified correctly.
Now also the MARGINAL MEDIAN and the VECTOR MEDIAN can be chosen as the
updated prototype:
MARGINAL MEDIAN: The vector of the unidimensional median of each
feature.
VECTOR MEDIAN: The vector median of n multidimensional samples is the
sample
with the minimum sum of Euclidean distances to the other samples.
Error estimation
---------------Perform an error estimation using the leave-one-out method. A nearest
neighbor classifier with Euclidean distance is used for the detection,
if
a sample was classified correctly or not.
A graph is generated, with the error rate as a function of the number
of
selected features. Note that the sequence of the selected features is
not relevant for some search strategies (e.g. Branch & Bound), but
important
for other strategies (e.g. Sequential Forward Search).
Identify from independent data
-----------------------------Use the K-Nearest-Neighbor Classifier to test the accuracy of the data
set.
Compare the unknown sample to *each* training sample (no data
compression).
Sammon plot for classes
----------------------Generate a 2-dimensional plot from the data considering the selected
features.
You have to specify the number of iterations for the procedure that
arranges
the samples in the x-y-plane. The more iterations, the better
(up to a certain saturation).
Highly overlapping classes provide not much information from the graph.
In
this case you might repeat the experiments with only two classes, for
instance. Then a better separation could be visible.
Statistical analysis
---------------------Available is only an analysis of linear correlation between 2 features.
One or more classes can be considered.
The statistical parameters MEAN and STANDARD DEVIATION are calculated
for
the two features and all included classes. The the COVARIANCE and
finally
the CORRELATION is calculated.
The 2-D graph is plotted with the first feature as the x-axis and the
second
feature as the y-axis.
Consult a standard book on statistics for the definitions.
Interfacing to other systems
-----------------------------1.) SNNS
1.1) Generate a network specification file for the SNNS program
package
(Stuttgart Neural Network Simulator).
The net is a 3-layer fully connected feedforward net with the
standard error backpropagation learning algorithm.
The features are connected to the input layer. If the feature
names
are available, they are attached to the input neurons. The
output
layer contains one neuron for each class. The hidden layer has
(2 * number_of_features + 1) neurons. (Kolmogorov Mapping Neural
Network Existence Theorem, Backpropagation Approximation
Theorem).
The user may later modify the topology and functionality of the
net in the SNNS application.
Generate also a pattern file for SNNS from the universe data.
1.2) Generate a pattern file from independent data using only the
selected features. The input file has the FORMAT mentioned
above.
2.) LVQ
2.1) Generate a file in the file format of the LVQ package from the
data,
using the selected feature set.
2.2) Read a data file with the same feature dimension as the input
file
and generates an output file with only the selected features.
Note
that the order of the features is normally disturbed.
3.) Merge two data files
This option allows to merge two data files column by column. The new
dimension of the output feature vector will be the sum of the two
dimensions of the input feature vectors. The names of each sample of
the two input files must be identical.
4.) Split the actual data set into a training data set and a test data
set.
Sometimes from a single data set one part is used to induce a
classifier
and the other part is used for test purposes. This option allows to
specify a percentage for the splitting: x% training data (100-x)%
test
data. The splitting is randomly done.
Batch demonstration
--------------------This item loads the input data (if not already), selects features by
the
Sequential Forward Search strategy, using the mutual Euclidean
interclass
distance as the selection criterion, performs an error estimation of
the
classifier using a K-Nearest Neighbor with Leave-One-Out and Euclidean
distance classifier, generates a Sammon plot for the selected features
and
learns the prototypes using the Q*-algorithm.
ACKNOWLEDGEMENTS
---------------Special thanks to
- Dr. Hannu Ahonen, VTT (Technical Research Centre of Finland),
Helsinki, Finland.
for the implementation of the Branch and Bound feature selection
algorithm
- Dinu Coltuc, ICPE (Research Institute for Electrotechnology),
Bucharest, Romania.
for the conception of the Q* learning algorithm
Download