% % ************************************** % TOOLDIAG % ************************************** % % A experimental pattern recognition package % % Copyright (C) 1992, 1993, 1994 Thomas W. Rauber % Universidade Nova de Lisboa & UNINOVA - Intelligent Robotics Center % Quinta da Torre, 2825 Monte da Caparica, PORTUGAL % E-Mail: tr@fct.unl.pt % % % INTRODUCTION % -----------% % TOOLDIAG is a experimental package for the analysis and visualization % of sensorial data. It permits the selection of features for a supervised % learning task, the error estimation of a classifier based on continuous % multidimensional feature vectors and the visualization of the data. % Furthermore it contains the Q* learning algorithm which is able to generate % prototypes from the training set. The Q* module has the same functionality % as the LVQ package described below. % Its main purpose is to give the researcher in the field a feeling for % the usefulness of sensorial data for classification purposes. % The visualization part of the program is executed by the program GNUPLOT. % The 'TOOLDIAG' package has an interface to the classifier program % 'LVQ_PAK' of T. Kohonen of Helsinki University and his programming team. % TOOLDIAG uses exactly the same data file format as LVQ_PAK. % Also the Stuttgart Neural Network Simulator (SNNS) can be interfaced. % It is possible to generate standard network definition files and pattern % files from the input files. Therefore TOOLDIAG can be considered as a % preprocessor for feature selection (use only the most important data for % the training of the network). % See the file 'other.systems' how to get a copy of the interfaced systems. % % HOW TO OBTAIN TOOLDIAG % ---------------------% % The TOOLDIAG software package can be obtained via anonymous, binary ftp. % Server: ftp.fct.unl.pt (192.68.178.2) % Directory: pub/di/packages % File: tooldiag<version>.tar.Z % >>>>>>>>>>>>>>>>>>>>>> END OF FILE "README" <<<<<<<<<<<<<<<<<<<<< REFERENCE --------In the current version, a great emphasis is put on feature selection. Many methods of the following book have been implemented: Devijver, P. A., and Kittler, J., "Pattern Recognition --- A Statistical Approach," Prentice/Hall Int., London, 1982. Originally only three feature selection algorithms were implemented. The application area was the monitoring and supervision of production processes, using inductive learning of process situations. The reference for these three original algorithms is: "A Toolbox for Analysis and Visualization of Sensor Data in Supervision" T.W. Rauber, M.M. Barata and A.S. Steiger-Gar\c{c}\~{a}o Proceedings of the International Conference on Fault Diagnosis, April 1993, Toulouse, France. A Postscript file of the paper is also available on the same server ftp.fct.unl.pt at directory pub/di/papers/Uninova with name tooldiag.ps.Z This file is a compressed UNIX postscript version and should print properly on every postscript printer. The Q* - algorithm is described in Rauber, T. W., Coltuc, D., and Steiger-Gar\c{c}\~{a}o, A. S., "Multivariate discretization of continuous attributes for machine learning," in K. S. Harber (Ed.), Proc. 7th Int. Symposium on Methodologies for Intelligent Systems (Poster Session), Trondheim, Norway, June 15-18, 1993, Oak Ridge National Laboratory, ORNL/TM-12375, Oak Ridge, TN, USA, 1993. (Avaliable as ismis93.ps.Z at the same site) (In order to avoid too much information, the algorithm can/should be considered independent from the context of the paper.) COPYRIGHT --------BECAUSE "TOOLDIAG" AS DOCUMENTED IN THIS DOCUMENT IS LICENSED FREE OF CHARGE, I PROVIDE ABSOLUTELY NO WARRANTY, TO THE EXTENT PERMITTED BY APPLICABLE STATE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING, I THOMAS W. RAUBER PROVIDE THE "TOOLDIAG" PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE "TOOLDIAG" PROGRAMS PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. THE "TOOLDIAG" PROGRAMS ARE FOR NON-COMMERCIAL USE ONLY, EXCEPT WITH THE WRITTEN PERMISSION OF THE AUTHOR. INSTALLATION -----------unix> uncompress tooldiag<version>.tar.Z unix> tar xf tooldiag<version>.tar COMPILATION ----------Edit the file "def.h" in the "src" directory and change the predefined variable DATA_DIR to the directory to which all output of the package should be directed (Naming is different in DOS and UNIX). Then type 'make' in UNIX. Some machines need slight adjustments, e.g. if the C-Library has only the functions 'rand()' and 'srand()' define the macro ONLY_RAND, or the memory allocation routine 'malloc()' is defined differently. In DOS create a project and include all .c files Use a 'large' model for compilation. Uncomment the line in the file dos.h which contains the macro #define DOS . SETUP ----Ensure that the program(s) that are called from tooldiag are included into your current PATH variable. E.g. in UNIX adapt the file 'setup' and execute it (under UNIX and bash with the command: . setup). This is especially important for the GNUPLOT program. EXECUTION --------The program can be called with or without command line options. If it is called without options it will automatically ask for the respective data files. The following command line options are available: tooldiag [-v] [-dir <data-directory> | -file <data-file>] [-sel <selected-features-file>] [-fnam <feature-names-file] -v TOOLDIAG outputs control messages (files saved etc.) -sel Load a set of already selected features from a file -fnam Description file of the features (optional) -dir Use the name of the following directory as the data input directory. <data-directory> is the name of the data directory. --- or ---file Use the name of the following file for the input data <data-file> is the name of that file (full path). Example: tooldiag -dir /usr/users/tr/ai/tooldiag/universes/iris/ will call the program and load the data files in the specified directory. - FILEFORMAT OF THE DATA FILES **************************** The example of the iris flower data provided by Fisher in 1936 is used as an illustration of the file formats. You have 2 options: 1.) Load data from directory ---------------------------The directory in which the data files are located must not contain other files than the data files. E.g. for the examples of the iris flower data the directory has only three files. This implies that each class is represented by one file. Each data file must have the following schema: <class_name> <dimension_of_the_feature_vector> <number_of_samples> <feature1_sample1 feature2_sample1 ... featureN_sample1> ... <feature1_sampleM feature2_sampleM ... featureN_sampleM> The data up to the feature values are ASCII characters. The feature values can be ASCII characters or binary floating point numbers (advantageous for very large data files). Note that the binary values are machine dependent. 1.) The 'class_name' defines the name for each class. No two classes may have the same name. E.g. 'setosa' is the name for one class of iris flower data. 2.) The dimension of the feature vector. In the iris case 4 features are given. 3.) How many samples are provided in this file? 50 should be specified for each of the 3 flower classes. 4.) The values of the features for each sample. 50 lines with 4 real numbers each would have to be specified here. (ASCII case) Example: The directory /usr/users/tr/ai/tooldiag/universes/iris/samples/ contains 3 files: iris1.dat iris2.dat iris3.dat The file iris3.dat looks like this (feature values in ASCII): virginica 4 50 6.3 3.3 6.0 2.5 5.8 2.7 5.1 1.9 ... (48 more lines) ... ... 2.) Load data from a single file -------------------------------This file format is compatible with that of the LVQ_PAK package of Kohonen. All input data is stored in a single file with the following format: <dimension_of_the_feature_vector> <feature1_sample1 feature2_sample1 ... featureN_sample1> <class_name_i> ... <feature1_sampleM feature2_sampleM ... featureN_sampleM> <class_name_j> Example: The file iris.dat looks like this 4 5.1 3.5 1.4 0.2 setosa 4.9 3.0 1.4 0.2 setosa ... 5.0 3.3 1.4 0.2 setosa 7.0 3.2 4.7 1.4 versicolor 6.4 3.2 4.5 1.5 versicolor ... 5.1 2.5 3.0 1.1 versicolor 5.7 2.8 4.1 1.3 versicolor 6.3 3.3 6.0 2.5 virginica 5.8 2.7 5.1 1.9 virginica ... 6.2 3.4 5.4 2.3 virginica 5.9 3.0 5.1 1.8 virginica Also here the feature values can be ASCII characters or binary floating point numbers. In the case of the binary feature values the class name follows directly after the last byte of the last feature value for that sample. After the class name a 'new line' character must appear. Example for binary file: .........<byte><byte>setosa<new-line><byte><byte>... - FILEFORMAT OF THE FEATURE SELECTION FILE **************************************** After a subset of all features has been selected, the indices of the features are stored in a text file. The first line contains the NAME of the data file or directory from which the data comes. The second line contains the NUMBER OF SELECTED FEATURES. Comment lines can be introduced, starting with a comment character '#'. The next line indicates if the data was NORMALIZED to [0,1] before the features were selected or UNNORMALIZED. The following lines contain the feature INDICES together with the score for the SELECTION CRITERION of the feature selection algorithm. Example: "iris.sel" iris.dat 2 # Were the feature values normalized to [0,1] during selection ? unnormalized 2 0.120000 3 0.046667 - FILEFORMAT OF THE FEATURE DESCRIPTION FILE ****************************************** This (optional) file contains the names of the features. In the first line the number of features must appear. This number must be equal to the number of features of the data file(s). Then in each line one feature name appears. The name must be a single connected string. If available, the feature description is forwarded to the SNNS 'net' file. For example in the case of the iris flowers we have the file "iris.nam" : 4 sepal_length sepal_width petal_length petal_width MENUS ===== In this section the functionality of the main menu and its submenus is outlined. Load universe from directory or file -------------------------------------Load the feature data. Normalize data to [0,1] ------------------------A normalization to all features is applied using the linear transform: new_value = (old_value - min ) / (max - min). This has the effect of scaling the values of all classes to the interval 0.0 to 1.0. This is especially useful if the differences in the values of the different feature values is very big. The normalization is done sequentially for each feature, i.e. univariately. Feature selection ------------------- Different search strategies are combined with different selection criteria. Besides auxiliary functions, like loading or saving of selected features. Feature extraction ------------------Linear feature extraction is implemented here. A matrix is calculated which maps the original samples to new samples with a lower dimension. The new features are linear combinations of the old features. Learning with Q* - algorithm ---------------------------The totality of all samples is compressed to a small set of representative prototypes. This module performs the same function as the LVQ algorithm of Kohonen. Q* learns a set of representative prototypes which compress the raw data considerably. The effort to obtain a optimal prototype set is proportional to the complexity of the statistical distribution of the data. For instance the "setosa" class of the iris data needs only one prototype, since it is linearly separable from the other two classes. The other two classes "virginica" and "versicolor" need more prototypes since they overlap. The original algorithm described in the paper updates the new prototypes as the MEAN of all samples that were classified correctly. Now also the MARGINAL MEDIAN and the VECTOR MEDIAN can be chosen as the updated prototype: MARGINAL MEDIAN: The vector of the unidimensional median of each feature. VECTOR MEDIAN: The vector median of n multidimensional samples is the sample with the minimum sum of Euclidean distances to the other samples. Error estimation ---------------Perform an error estimation using the leave-one-out method. A nearest neighbor classifier with Euclidean distance is used for the detection, if a sample was classified correctly or not. A graph is generated, with the error rate as a function of the number of selected features. Note that the sequence of the selected features is not relevant for some search strategies (e.g. Branch & Bound), but important for other strategies (e.g. Sequential Forward Search). Identify from independent data -----------------------------Use the K-Nearest-Neighbor Classifier to test the accuracy of the data set. Compare the unknown sample to *each* training sample (no data compression). Sammon plot for classes ----------------------Generate a 2-dimensional plot from the data considering the selected features. You have to specify the number of iterations for the procedure that arranges the samples in the x-y-plane. The more iterations, the better (up to a certain saturation). Highly overlapping classes provide not much information from the graph. In this case you might repeat the experiments with only two classes, for instance. Then a better separation could be visible. Statistical analysis ---------------------Available is only an analysis of linear correlation between 2 features. One or more classes can be considered. The statistical parameters MEAN and STANDARD DEVIATION are calculated for the two features and all included classes. The the COVARIANCE and finally the CORRELATION is calculated. The 2-D graph is plotted with the first feature as the x-axis and the second feature as the y-axis. Consult a standard book on statistics for the definitions. Interfacing to other systems -----------------------------1.) SNNS 1.1) Generate a network specification file for the SNNS program package (Stuttgart Neural Network Simulator). The net is a 3-layer fully connected feedforward net with the standard error backpropagation learning algorithm. The features are connected to the input layer. If the feature names are available, they are attached to the input neurons. The output layer contains one neuron for each class. The hidden layer has (2 * number_of_features + 1) neurons. (Kolmogorov Mapping Neural Network Existence Theorem, Backpropagation Approximation Theorem). The user may later modify the topology and functionality of the net in the SNNS application. Generate also a pattern file for SNNS from the universe data. 1.2) Generate a pattern file from independent data using only the selected features. The input file has the FORMAT mentioned above. 2.) LVQ 2.1) Generate a file in the file format of the LVQ package from the data, using the selected feature set. 2.2) Read a data file with the same feature dimension as the input file and generates an output file with only the selected features. Note that the order of the features is normally disturbed. 3.) Merge two data files This option allows to merge two data files column by column. The new dimension of the output feature vector will be the sum of the two dimensions of the input feature vectors. The names of each sample of the two input files must be identical. 4.) Split the actual data set into a training data set and a test data set. Sometimes from a single data set one part is used to induce a classifier and the other part is used for test purposes. This option allows to specify a percentage for the splitting: x% training data (100-x)% test data. The splitting is randomly done. Batch demonstration --------------------This item loads the input data (if not already), selects features by the Sequential Forward Search strategy, using the mutual Euclidean interclass distance as the selection criterion, performs an error estimation of the classifier using a K-Nearest Neighbor with Leave-One-Out and Euclidean distance classifier, generates a Sammon plot for the selected features and learns the prototypes using the Q*-algorithm. ACKNOWLEDGEMENTS ---------------Special thanks to - Dr. Hannu Ahonen, VTT (Technical Research Centre of Finland), Helsinki, Finland. for the implementation of the Branch and Bound feature selection algorithm - Dinu Coltuc, ICPE (Research Institute for Electrotechnology), Bucharest, Romania. for the conception of the Q* learning algorithm