473-342

advertisement
The Program System for Intellectual Data Analysis, Recognition and
Forecasting
YU.I. ZHURAVLEV, V.V. RYAZANOV, O.V. SENKO, A.S. BIRYUKOV, D.P. VETROV, A.A. DOKUKIN,
D.A. KROPOTOV
Dorodnicyn Computing Centre of the Russian Academy of Sciences
Moscow, Vavilov str., 40
RUSSIAN FEDERATION
Abstract: In many spheres of human activities like medicine, economics, sociology, physics, biology etc. there
appear tasks of data analysis for recognition, classification and forecasting. Meanwhile users who usually aren't
experts in pattern recognition theory prefer using final software. This paper presents the first version of
universal program system for intellectual data analysis, recognition and forecasting that combines different
methods and approaches with friendly graphical user interface, which can be serve as effective instrument both
for application task specialists and experts in pattern recognition.
Key-Words: Pattern recognition, data-mining, knowledge discovery, intellectual software, logical regularities,
classifier fusion
1 Introduction
Tasks of classification, forecasting or
taxonomy by multivariate empirical data sets rather
often arise in many branches of human activities.
These tasks can be solved with the help of computer
methods of pattern recognition and unsupervised
classification (cluster analysis). Pattern recognition
and cluster analysis tools exist in many program
systems that are offered in the market now. For
example, such methods are contained in well known
statistical packets (STATISTICA, SPSS, STADIA,
Forecast Expert). In addition to statistical tools there
are also packets suggesting for solution of pattern
recognition tasks neural networks and methods that
being developed by Artificial Intelligence
researchers. As examples we can give neural
network packets BrainMaker and NeuriShell, the
systems of reasoning by precedents KATE tools and
Pattern Recognition WorkBench, packet WizWhy
which is based on idea of limited numeration.
The main drawback of majority of existing
packets is their too narrow specialization.
Sometimes they suggest only one method or several
methods that belong to the same approach
(statistical, neuron or others). At the same time
methods or even approaches that are successful at
some range of tasks types are not sufficiently
represented in the market. Such situation restricts
the possibility for selection the really best of
existing methods or for constructing the aggregation
schemes.
This work presents the first version of
universal program system for intellectual data
analysis, recognition and forecasting which avoids
the mentioned drawbacks and sufficiently facilitates
research in pattern recognition field.
The article is organized as follows. The next
section provides some main principles for such a
system, in section 3 the integrated methods are
enumerated, section 4 discusses quality control
procedures, next section is devoted to visualization
aspects and the last one shows the experimental
results.
2 Basic concepts of the system
The common requirements to the system are
based on ideas of universality and intellectuality.
The universality of the system is understood as
possibility of its application to the maximum variety
of different types of tasks (in a sense of dimension,
type, quality, data structure and output values). The
intellectuality is considered as presence of
autotuning elements and availability of successful
tasks solving by unqualified person.
In this context the software product should
include the following options: joint realization of
different recognition and classification methods,
construction of classifier fusion algorithms based on
solutions created by different methods [17], [11],
modern-style friendly graphical user interface,
unification of user interface for different methods,
visual presentation of the data and the results of
training, presentation of reports on training and
classification in the unified handy manner,
automatic training, ready to use scenarios of training
and methods for training quality estimation [12].
Also during the development of this software an
accent was made to the possibility of further
upgrade of the system without changing the already
developed modules.
3 Integrated methods
For the purposes mentioned above a large
variety of different methods from various
algorithmic families were implemented and
integrated into the system. They are pattern
recognition methods "q-nearest neighbors" [2], [13],
"linear Fisher discriminant" [2], [13], "linear
machines" [10], "multilayer perceptron", "test
algorithm" [17], "support vector machines" [1],
"method of statistically weighted syndromes" [5],
"estimations evaluation algorithm" [17], [11],
"logical regularities" [9], [18] and clusterization
technics "k-means", "iterative optimization",
"hierarchy
grouping",
"method
of
local
optimization" [2].
In addition there was undertaken detailed
research on possible modification of these methods.
The most of integrated methods has some
modifications to classical ones or has in-house
designed. Thus the realization of Linear Fisher
Discriminant includes the regularization of
covariance matrix and fuzzy classification, qNearest Neighbours method has estimates
regularization for the case of small data sets and
possibility of considering a priori probabilities of
classes.
Since the system is highly upgradeable and
new methods can be added in a simple plug-in like
manner, the list of integrated methods can be
increased in future.
The key feature of the system is
implementation of several methods of classifier
fusion for recognition and clusterization. They are
different committee methods (such as taking
maximum, minimum, product, mean and simple
mode majority of posterior probabilities) [6],
dynamic Woods method [15], Naive Bayes
approach [16], decision templates [8], clustering and
selection method [7], convex stabilizer [14] and
method for collective unsupervised learning
decision. The standardized output provides high
level of universality allowing joining results of
completely different approaches.
4 The quality control procedures
The "Recognition" software system goes
beyond simple collection of different methods and
the next feature that is worth mentioning is its
quality control procedures.
It is well known that the pattern recognition
task can't be reduced to methods of separating of
some sets in feature space. It is important for the
researcher not only to create an algorithm which
realizes the required separation of data, but to get
the estimation of reliability of such decision [3].
That's why the software system contains procedures
for quality control estimations.
These are different variants of crossvalidation and confidence interval construction for
error estimation which is based on statistical
approach.
5 Visualization in the system
In order to make the work with the system
more efficient and productive, the program contains
very flexible friendly user interface for
representation of samples and different reports (The
main window of the system is presented on Figure
1).
Fig. 1. The main window of the system
Visualization of data sets includes 1D and 2D
projections to the subspace of features and some
special kind of 2D projection preserving as much as
possible the n-dimensional distances between
objects.
The presentation of all reports is also made in
a user friendly manner. All information is grouped
and colored so that it's easy both to get the concept
and to find out some special details.
6 Performance in real-life
environment
For the current moment the system is
successfully tested on 20 model and real practical
tasks. Here are results of three of them. (see also
Figure 2).
6.1 Recognition of melanoma by the set of
geometrical and radiological features
The recognition task of malignant/nonmalignant patients formations is considered. For its
solving the set of 33 features is used, both
geometrical and radiological. The training sample
consists of 17 descriptions of patients with nonmalignant formations and 11 with malignant.
The specificity of the task consists in very
small number of objects and at the same time
relatively high number of features. This situation is
considered troublesome for statistical and large
parametrical models. The results of diagnostics of
testing 22 patients confirm this fact. The best result
was achieved with logical method ("Test
Algorithm"). Notice that classifier fusions results
were close to the optimal one - 91% of correct
answers with confidence interval [0.02, 0.28] at
significance level 95\%.
Every signal is encoded with 34 features. Training
sample for this task combines 181 objects while the
testing is performed on 170 signals sample.
There are two peculiarities in this task. First
of all there is a great number of unknown feature
values. And in addition the number of objects in the
second class is twice larger that in the first one.
This task clearly shows the importance of using
classifier fusion solutions, which results are close to
the single best methods results or even surpass them.
These examples and other successfully solved
problems allow us claiming that the settled goals for
the first version of the product were successfully
accomplished. Its universality allows choosing most
appropriate method for the problem and
intellectuality allows making it automatically. We
hope that the end users of the system will appreciate
its performance quality and usability at its true
value.
7 Acknowledgements
Fig. 2. Task Examples
6.2
Estimation of habitation costs in
Boston's suburbs
The data was taken from UCI Repository of
Machine Learning Databases and Domain Theories,
originally from [4].
The task of automatic estimation of habitation
costs has been solved as recognition task with five
classes ("very low", "low", "average", "high" and
"very high"). The set of 13 ecological, social and
technical features was used, among them there are
nitric oxides concentration, index of accessibility to
radial highways, etc. The number of objects in
training and testing samples was correspondingly
242 and 264.
The recognition errors occur only with
attribution of some descriptions to neighboring
classes, which is natural due to artificially designed
classes. It also explains the fact that the best results
were shown by methods constructing some kind of
optimal separating surfaces.
6.3 Recognition of structural peculiarities in
ionosphere
The objects of recognition are echoed radar
signals carrying the information about presence or
absence of some special structures in ionosphere.
This work is created with the support of the
Russian Foundation for Basic Research (grants
№02-01-00558, 02-07-90134, 02-07-90137, 03-0100580, 04-01-08045), programs №7, 17 of
fundamental research of RAS presidium, the
Foundation of Assistance to Small Innovative
Enterprises (contract №1680p/3566) and INTAS
(grants №00-626, 00-397, 03-56-182INNO).
References:
[1] C. Burges, A Tutorial on Support Vector
Machines, Data Mining and Knowledge Discovery,
Vol.2, 1998, pp.121-167.
[2] R. Duda, P. Hart, Pattern Recognition and Scene
Analysis. John Wiley and Sons, 1973.
[3] S. Gurov, Reliability Estimation of Classifiers.
Moscow University Press, 2002.
[4] D. Harrison and D. Rubinfeld. Hedonic prices
and the demand for clean air. J. Environ. Economics
& Management, Vol.5, 1978, pp.81–102.
[5] A. Jackson, A. Ivshina, O. Senco, A.
Kuznetsova, and et al. Prognosis of intravesical
bacillus calmette-guerin therapy for superficial
bladder cancer by immunological urinary
measurements: Statistically weighted syndromes
analysis. Journal of Urology, Vol.159, No.3, 1998,
pp.1054–1063.
[6] J. Kittler, M. Fatef, R. Duin, and J.Matas. On
combining classifiers. IEEE Transactions on Pattern
Analysis and Machine Intelligence, Vol.20, No.3,
1998, pp.226–239.
[7] A. Kopustinkas and A. Lipnickas. Classifiers
fusion with data dependent aggregation schemes.
Proceedings of the 7th International Conference on
Information Networks, Systems and Technologies,
Vol.1, 2001, pp.147–153.
[8] L. I. Kuncheva, J. S. Bezdek, and R. P. W. Duin.
Decision templates for multiple classifier fusion: an
experimental comparison. Pattern Recognition,
Vol.34, No.2, 2001, pp.299–314.
[9] V. V. Ryazanov. On finding of clusters in the
form of hyperparallelepipeds. Proceedings of the 6th
International Conference on Pattern Recognition
and Information Processing, Minsk, Belarus, 2001.
[10] V. V. Ryazanov and A. S. Obukhov. On using
of relaxation algorithm for optimization of linear
decision rules (in Russian). Proceedings of 10th allrussian conference on Mathematical Methods of
Pattern Recognition, Vol.1, 2001, pp.102–104.
[11] V. V. Ryazanov, O. V. Senko, and Y. I.
Zhuravlev. Mathematical methods for pattern
recognition: Logical, optimization, algebraic
approaches. Proceedings of the 14th International
Conference on Pattern Recognition, Brisbane,
Australia, 1998.
[12] O. V. Senko. A method for estimating adequacy
of approximation models. Pattern Recognition And
Image Analysis, Vol.1, No.1, 2001, pp.85–86.
[13] J. Tou and R. Gonzalez. Pattern Recognition
Principles. Addison Wesley Publishing Co,
Reading, MA, 1974.
[14] D. P. Vetrov. On the stability of the pattern
recognition algorithms. Pattern Recognition And
Image Analysis, Vol.13, No.3, 2003, pp.470–475.
[15] K. Woods, W. Keelmeyer, and K. Bowyer.
Combination of multiple classifiers using local
accuracy estimates. IEEE Transactions on Pattern
Analysis and Machine Intelligence, Vol.19, 1997,
pp.405–410.
[16] L. Xu, A. Krzyzak, and C. Suen. Methods of
combining multiple classifiers and their application
to handwriting recognition. IEEE Transactions on
Systems, Man, Cybernetics, Vol.22, 1992, pp.418–
435.
[17] Y. I. Zhuravlev. Selected Scientific Works (in
russian). Magistr, Moscow, Russia, 1998.
[18] Y. I. Zhuravlev and V. V. Ryazanov. On
knowledge extracting from sets of precedents in
classification models based on partial precedents
principles. Proceedings of the 14th International
Conference on Pattern Recognition, Bribane,
Australia, 1998.
Download