The Program System for Intellectual Data Analysis, Recognition and Forecasting YU.I. ZHURAVLEV, V.V. RYAZANOV, O.V. SENKO, A.S. BIRYUKOV, D.P. VETROV, A.A. DOKUKIN, D.A. KROPOTOV Dorodnicyn Computing Centre of the Russian Academy of Sciences Moscow, Vavilov str., 40 RUSSIAN FEDERATION Abstract: In many spheres of human activities like medicine, economics, sociology, physics, biology etc. there appear tasks of data analysis for recognition, classification and forecasting. Meanwhile users who usually aren't experts in pattern recognition theory prefer using final software. This paper presents the first version of universal program system for intellectual data analysis, recognition and forecasting that combines different methods and approaches with friendly graphical user interface, which can be serve as effective instrument both for application task specialists and experts in pattern recognition. Key-Words: Pattern recognition, data-mining, knowledge discovery, intellectual software, logical regularities, classifier fusion 1 Introduction Tasks of classification, forecasting or taxonomy by multivariate empirical data sets rather often arise in many branches of human activities. These tasks can be solved with the help of computer methods of pattern recognition and unsupervised classification (cluster analysis). Pattern recognition and cluster analysis tools exist in many program systems that are offered in the market now. For example, such methods are contained in well known statistical packets (STATISTICA, SPSS, STADIA, Forecast Expert). In addition to statistical tools there are also packets suggesting for solution of pattern recognition tasks neural networks and methods that being developed by Artificial Intelligence researchers. As examples we can give neural network packets BrainMaker and NeuriShell, the systems of reasoning by precedents KATE tools and Pattern Recognition WorkBench, packet WizWhy which is based on idea of limited numeration. The main drawback of majority of existing packets is their too narrow specialization. Sometimes they suggest only one method or several methods that belong to the same approach (statistical, neuron or others). At the same time methods or even approaches that are successful at some range of tasks types are not sufficiently represented in the market. Such situation restricts the possibility for selection the really best of existing methods or for constructing the aggregation schemes. This work presents the first version of universal program system for intellectual data analysis, recognition and forecasting which avoids the mentioned drawbacks and sufficiently facilitates research in pattern recognition field. The article is organized as follows. The next section provides some main principles for such a system, in section 3 the integrated methods are enumerated, section 4 discusses quality control procedures, next section is devoted to visualization aspects and the last one shows the experimental results. 2 Basic concepts of the system The common requirements to the system are based on ideas of universality and intellectuality. The universality of the system is understood as possibility of its application to the maximum variety of different types of tasks (in a sense of dimension, type, quality, data structure and output values). The intellectuality is considered as presence of autotuning elements and availability of successful tasks solving by unqualified person. In this context the software product should include the following options: joint realization of different recognition and classification methods, construction of classifier fusion algorithms based on solutions created by different methods [17], [11], modern-style friendly graphical user interface, unification of user interface for different methods, visual presentation of the data and the results of training, presentation of reports on training and classification in the unified handy manner, automatic training, ready to use scenarios of training and methods for training quality estimation [12]. Also during the development of this software an accent was made to the possibility of further upgrade of the system without changing the already developed modules. 3 Integrated methods For the purposes mentioned above a large variety of different methods from various algorithmic families were implemented and integrated into the system. They are pattern recognition methods "q-nearest neighbors" [2], [13], "linear Fisher discriminant" [2], [13], "linear machines" [10], "multilayer perceptron", "test algorithm" [17], "support vector machines" [1], "method of statistically weighted syndromes" [5], "estimations evaluation algorithm" [17], [11], "logical regularities" [9], [18] and clusterization technics "k-means", "iterative optimization", "hierarchy grouping", "method of local optimization" [2]. In addition there was undertaken detailed research on possible modification of these methods. The most of integrated methods has some modifications to classical ones or has in-house designed. Thus the realization of Linear Fisher Discriminant includes the regularization of covariance matrix and fuzzy classification, qNearest Neighbours method has estimates regularization for the case of small data sets and possibility of considering a priori probabilities of classes. Since the system is highly upgradeable and new methods can be added in a simple plug-in like manner, the list of integrated methods can be increased in future. The key feature of the system is implementation of several methods of classifier fusion for recognition and clusterization. They are different committee methods (such as taking maximum, minimum, product, mean and simple mode majority of posterior probabilities) [6], dynamic Woods method [15], Naive Bayes approach [16], decision templates [8], clustering and selection method [7], convex stabilizer [14] and method for collective unsupervised learning decision. The standardized output provides high level of universality allowing joining results of completely different approaches. 4 The quality control procedures The "Recognition" software system goes beyond simple collection of different methods and the next feature that is worth mentioning is its quality control procedures. It is well known that the pattern recognition task can't be reduced to methods of separating of some sets in feature space. It is important for the researcher not only to create an algorithm which realizes the required separation of data, but to get the estimation of reliability of such decision [3]. That's why the software system contains procedures for quality control estimations. These are different variants of crossvalidation and confidence interval construction for error estimation which is based on statistical approach. 5 Visualization in the system In order to make the work with the system more efficient and productive, the program contains very flexible friendly user interface for representation of samples and different reports (The main window of the system is presented on Figure 1). Fig. 1. The main window of the system Visualization of data sets includes 1D and 2D projections to the subspace of features and some special kind of 2D projection preserving as much as possible the n-dimensional distances between objects. The presentation of all reports is also made in a user friendly manner. All information is grouped and colored so that it's easy both to get the concept and to find out some special details. 6 Performance in real-life environment For the current moment the system is successfully tested on 20 model and real practical tasks. Here are results of three of them. (see also Figure 2). 6.1 Recognition of melanoma by the set of geometrical and radiological features The recognition task of malignant/nonmalignant patients formations is considered. For its solving the set of 33 features is used, both geometrical and radiological. The training sample consists of 17 descriptions of patients with nonmalignant formations and 11 with malignant. The specificity of the task consists in very small number of objects and at the same time relatively high number of features. This situation is considered troublesome for statistical and large parametrical models. The results of diagnostics of testing 22 patients confirm this fact. The best result was achieved with logical method ("Test Algorithm"). Notice that classifier fusions results were close to the optimal one - 91% of correct answers with confidence interval [0.02, 0.28] at significance level 95\%. Every signal is encoded with 34 features. Training sample for this task combines 181 objects while the testing is performed on 170 signals sample. There are two peculiarities in this task. First of all there is a great number of unknown feature values. And in addition the number of objects in the second class is twice larger that in the first one. This task clearly shows the importance of using classifier fusion solutions, which results are close to the single best methods results or even surpass them. These examples and other successfully solved problems allow us claiming that the settled goals for the first version of the product were successfully accomplished. Its universality allows choosing most appropriate method for the problem and intellectuality allows making it automatically. We hope that the end users of the system will appreciate its performance quality and usability at its true value. 7 Acknowledgements Fig. 2. Task Examples 6.2 Estimation of habitation costs in Boston's suburbs The data was taken from UCI Repository of Machine Learning Databases and Domain Theories, originally from [4]. The task of automatic estimation of habitation costs has been solved as recognition task with five classes ("very low", "low", "average", "high" and "very high"). The set of 13 ecological, social and technical features was used, among them there are nitric oxides concentration, index of accessibility to radial highways, etc. The number of objects in training and testing samples was correspondingly 242 and 264. The recognition errors occur only with attribution of some descriptions to neighboring classes, which is natural due to artificially designed classes. It also explains the fact that the best results were shown by methods constructing some kind of optimal separating surfaces. 6.3 Recognition of structural peculiarities in ionosphere The objects of recognition are echoed radar signals carrying the information about presence or absence of some special structures in ionosphere. This work is created with the support of the Russian Foundation for Basic Research (grants №02-01-00558, 02-07-90134, 02-07-90137, 03-0100580, 04-01-08045), programs №7, 17 of fundamental research of RAS presidium, the Foundation of Assistance to Small Innovative Enterprises (contract №1680p/3566) and INTAS (grants №00-626, 00-397, 03-56-182INNO). References: [1] C. Burges, A Tutorial on Support Vector Machines, Data Mining and Knowledge Discovery, Vol.2, 1998, pp.121-167. [2] R. Duda, P. Hart, Pattern Recognition and Scene Analysis. John Wiley and Sons, 1973. [3] S. Gurov, Reliability Estimation of Classifiers. Moscow University Press, 2002. [4] D. Harrison and D. Rubinfeld. Hedonic prices and the demand for clean air. J. Environ. Economics & Management, Vol.5, 1978, pp.81–102. [5] A. Jackson, A. Ivshina, O. Senco, A. Kuznetsova, and et al. Prognosis of intravesical bacillus calmette-guerin therapy for superficial bladder cancer by immunological urinary measurements: Statistically weighted syndromes analysis. Journal of Urology, Vol.159, No.3, 1998, pp.1054–1063. [6] J. Kittler, M. Fatef, R. Duin, and J.Matas. On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.20, No.3, 1998, pp.226–239. [7] A. Kopustinkas and A. Lipnickas. Classifiers fusion with data dependent aggregation schemes. Proceedings of the 7th International Conference on Information Networks, Systems and Technologies, Vol.1, 2001, pp.147–153. [8] L. I. Kuncheva, J. S. Bezdek, and R. P. W. Duin. Decision templates for multiple classifier fusion: an experimental comparison. Pattern Recognition, Vol.34, No.2, 2001, pp.299–314. [9] V. V. Ryazanov. On finding of clusters in the form of hyperparallelepipeds. Proceedings of the 6th International Conference on Pattern Recognition and Information Processing, Minsk, Belarus, 2001. [10] V. V. Ryazanov and A. S. Obukhov. On using of relaxation algorithm for optimization of linear decision rules (in Russian). Proceedings of 10th allrussian conference on Mathematical Methods of Pattern Recognition, Vol.1, 2001, pp.102–104. [11] V. V. Ryazanov, O. V. Senko, and Y. I. Zhuravlev. Mathematical methods for pattern recognition: Logical, optimization, algebraic approaches. Proceedings of the 14th International Conference on Pattern Recognition, Brisbane, Australia, 1998. [12] O. V. Senko. A method for estimating adequacy of approximation models. Pattern Recognition And Image Analysis, Vol.1, No.1, 2001, pp.85–86. [13] J. Tou and R. Gonzalez. Pattern Recognition Principles. Addison Wesley Publishing Co, Reading, MA, 1974. [14] D. P. Vetrov. On the stability of the pattern recognition algorithms. Pattern Recognition And Image Analysis, Vol.13, No.3, 2003, pp.470–475. [15] K. Woods, W. Keelmeyer, and K. Bowyer. Combination of multiple classifiers using local accuracy estimates. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.19, 1997, pp.405–410. [16] L. Xu, A. Krzyzak, and C. Suen. Methods of combining multiple classifiers and their application to handwriting recognition. IEEE Transactions on Systems, Man, Cybernetics, Vol.22, 1992, pp.418– 435. [17] Y. I. Zhuravlev. Selected Scientific Works (in russian). Magistr, Moscow, Russia, 1998. [18] Y. I. Zhuravlev and V. V. Ryazanov. On knowledge extracting from sets of precedents in classification models based on partial precedents principles. Proceedings of the 14th International Conference on Pattern Recognition, Bribane, Australia, 1998.