Knowledge Discovery in Databases: A Comparison of Different Views. Eva Andrássyová, MSc. Dept. of Cybernetics and Artificial Intelligence Technical University of Košice, Slovakia andrassy@tuke.sk Ján Paralič, PhD. Dept. of Cybernetics and Artificial Intelligence Technical University of Košice, Slovakia paralic@tuke.sk The field of knowledge discovery in databases (KDD) is getting to be very popular and has grown recently. The large amounts of data collected and stored might contain some information, which could be useful, but it is not obvious to recognise, nor trivial to obtain it. There is no human capable to sift through such amounts of data and even some existing algorithms are inefficient when trying to solve this task. KDD systems incorporate techniques from large variety of related fields to utilise their strengths in process of discovering knowledge. Working on the international project GOAL1 (Geographic Information On-Line Analysis: GIS - Data Warehouse Integration) we have studied several publications to obtain an idea what the KDD (process) is and what it is not as well. We have studied which techniques are applicable in this process, what tasks are to solve and which particular steps the process should take. Interdisciplinary nature of KDD causes that terminology used varies from source to source. The aim of this paper is to compare the notions and definitions of KDD within sources we had studied and to point out the similarities and the differences. From the particular steps of the KDD process we have focused on data mining step. (KDD is often misleadingly called data mining.) An attempt to link together techniques and methods as well as tasks listed in each source under different names is presented here in form of tables. We have made conclusions in sense that we have chosen the best views for our later use. 1 Introduction Huge amount of data collected in various processes, either manufacturing or business, (often as a side effect of computerisation) should be thoroughly analysed as they might contain some precious information for decision support. There is nothing new about analysing data, but it is in the amount of data, where traditional methods are becoming inefficient. It is often misleadingly believed that data mining is the new powerful technology. "The new is the confluence of (fairly) mature offshoots of such technologies as visualisation, statistics, machine learning and deductive databases, at a time when the world is ready to see their value." [11] 2 2.1 The process of KDD Definition When studying literature with topic of data mining we have encountered with terms such like: data mining, knowledge discovery in databases or abbreviation KDD. In various sources those terms are explained on rather different way. Below are listed some of them to show their variety. Quite clear definition of data mining is presented in [12] (Simoudis): 1 INCO-COPERNICUS Project 977091. Data mining - the process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions. Different view is presented in [10] (Mannila) where definition is as follows: Knowledge discovery in databases (often called data mining) aims at the discovery of useful information from large collections of data. In addition the author puts special stress on fact that the task of KDD is inherently interactive and iterative, and it is a process containing several steps where data mining is one of them. (However in the rest of this article it is difficult to distinguish between KDD and DM.) According to [8] (Hedberg) KDD is abbreviation of knowledge discovery and data mining, which may lead to confusion. In our opinion the most sophisticated definition is one according to [5] (Fayyad et al.), where authors have determined that knowledge discovery in databases is interactive and iterative process with several steps and data mining is a part of this process. Process of KDD is defined as: The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. The terms of above definition are explained as follows: pattern - models or structure in data (traditional sense) - expression in some language describing a subset of the data or a model applicable to that subset (data comprises a set of facts) process - implies there are many steps repeated in multiple iterations nontrivial (process) - it must involve search for structure, models, patterns, or parameters valid - discovered patterns should be valid for new data with some degree of certainty novel - at least to the system and preferably to the user potentially useful - for the user or task understandable - discovered patterns should be understandable - if not immediately, then after some postprocessing. Authors suggest that this definition implies the way of defining quantitative measures for evaluation of extracted patterns based on required and obtained notions. For validity we can define measure of certainty or utility (gain in some currency, due to better prediction). Such notions as novelty and understandability are more subjective, in some cases understandability can be estimated by simplicity (number of bits needed to describe a pattern). Interestingness is name of the notion for overall measure, which includes validity, novelty, usefulness and simplicity. Interestingness functions can be explicitly defined or manifested implicitly (ordering of discovered patterns by KDD system). In the rest of studied sources there are no measures considered for discovered patterns evaluation (namely [12]). In some views ([10] for instance) the evaluation should be left on user who is given all patterns that satisfy the user specifications and have sufficient frequency in the data. In author's opinion this is an advantage of such a system as every user has different subjective measure for interestingness according to his prior knowledge. In most of sources the term Data Mining (DM) is often used to name the field of knowledge discovery. This confusing use of terms KDD and DM is due to historical reasons and due to fact that the most of the work is focused on refinement and applicability experiments of ML and AI algorithms for the data mining step. Preprocessing is often included in this step as a part of mining algorithm. 2.2 Steps of KDD process According to definition above, the KDD is an interactive and iterative process. It means that at any stage the user should have possibility to make changes (for instance to choose different task or technique) and repeat the following steps to achieve better results. In Tab. 1 are listed particular steps of the KDD where we compared the terms of different sources. The table is organised on the way that the terms in the row refer to the same action. Simoudis [12] data selection data transformation data mining result interpretation Mannila [10] understanding the domain Fayyad et al. [5] learning the application domain creating a target dataset data cleaning and preprocessing data reduction and preparing the data set projection choosing the function of data mining choosing the data mining algorithm(s) discovering patterns (data mining) data mining postprocessing of interpretation discovered patterns putting the results using discovered into use knowledge Tab. 1 The process of KDD - list of steps. Brachman & Anand [1] task discovery data discovery data cleaning model development data analysis output generation Data mining (the dark grey coloured row) gets the most attention in research, therefore in publications as well. Those are mostly focused on learning algorithms, some methods combine data mining with previous data preparation (light grey coloured row), which is usually dataset reduction. The KDD process according to [1] is outlined on Fig. 1. The first two steps of the KDD process, namely task discovery and data discovery, produce the first input (goal of the KDD process). The following steps in the KDD process are data cleaning, model development, data analysis and output generation. In the following the inputs and steps of a KDD process according to [1] will be described in more details. Task Discovery is one of first steps of KDD. Client has to state the problem or goal, which often seems to be clear. Further investigation is recommended such as to get acquainted with customer's organisation after spending some time at the place and to sift through the raw data (to understand its form, content, organisational role and sources of data). Then the real goal of the discovery will be found. Data Discovery is complementary to step of task discovery. In the step of data discovery, we have to decide whether quality of data is satisfactory for the goal (what data does or does not cover). Task Discovery Report Domain Model Goal Model Development Data Cleaning Data Analysis Output Generation Data Discovery Action Model Data Dictionary Database Query tools Input Statistics & AI tools Visualisation tools Output Process flow Presentation tools Tool Data flow Data transformation tools Monitor Process task Tool usage Fig. 1 Schema of the KDD process. Domain Model plays an important role in the KDD process, though it often remains in the mind of the expert. A data dictionary, integrity constraints and various forms of metadata from the DBMS can possibly contribute to retrieval of the background knowledge for the KDD purposes as well as some analysis techniques. Those can take advantage of formally represented knowledge when fitting data to a model (for example ML techniques such as explanation-based learning integrated with inductive learning techniques). Data Cleaning is often necessary though it may happen that something removed by cleaning can be indicator of some interesting domain phenomenon (outlier or key data point?). Analyst's background knowledge is crucial in data cleaning provided by comparisons of multiple sources. Other way is to clean data before loaded into database by editing procedures. Recently, the data for KDD are coming form data warehouses which contain data already cleaned on some way. Model Development is an important phase of KDD that must precede actual analysis of the data. Interaction with the data leads analysts to formation of hypothesis (it is often based on experience and background knowledge). Sub-processes of model development are: data segmentation (unsupervised learning techniques, for example clustering); model selection (choosing the best type of model after exploring several different types); parameter selection (parameters of chosen model). Data Analysis is in general an ambition to understand why certain groups of entities are behaving on the way they do, it is search for laws or rules of such behaviour. As first should be analysed those parts where such a groups are already identified. Sub-processes in data analysis are: model specification - some formalism is used to denote specific model; model fitting - when necessary the specific parameters are determined (in some cases the model is independent from data in other cases the model has to be fitted to training data); evaluation - model is evaluated against the data; model refinement - model is refined in iterations according to the evaluation results. As mentioned above the model development and data analysis are complementary so it often leads to oscillation between those two steps. Output Generation - output can be in various forms. The simplest form is a report with analysis results. The other, more complicated forms, are graphs or in some cases it is desirable to obtain action descriptions which might be taken directly as outputs. Or there should be a monitor as the output, which should trigger an alarm or action under some certain condition. Output requirements might determine task of designed KDD application. 3 Tasks of DM For the title of this section we took the term tasks of DM, though we went thorough different terms as shown in the Tab. 2. This table is organised on the way that the tasks in a row (from different sources) refer to the same task. This organisation is based on particular description of table items available in particular source. Freitas [6] tasks discovery of SQO rules discovery of database dependencies discovery of association rules dependence modeling Simoudis [12] operations query and reporting multidimensional analysis Fayyad et al. [5] model functions link analysis (associations or relations between the records) causation modeling classification regression summarisation database segmentation (clustering) association rules dependency modeling dependency modeling change and deviation detection clustering clustering finding episodes from sequences sequence analysis predictive modelling (C4.5, NN) statistical analysis (EDA2) Mannila [10] problem finding keys or functional dependencies link analysis deviation detection deviation detection clustering Fayyad et al. [4] tasks classification classification regression regression summarisation (EDA) Tab. 2 Tasks of DM. summarisation (EDA) We accepted the list of tasks in the first column as a standard and a brief description of DM tasks is as follows (For more details see [6].): discovery of SQO rules - to perform a syntactical transformation of the incoming query to produce more efficient query by adding or removing conjuncts; characteristic for SQO rules is that the query processing time ( derived from access method and indexing scheme of DBMS) is take into account as cost of attribute; 2 Exploratory Data Analysis 4 discovery of database dependencies - in this case the term refers to relationships among attributes of relations; discovery of association rules - relationship of sets of items, those are assigned by support and confidence factor; dependence modeling - dependencies among attributes in form of if-then rules as "if (A is true)then (C is true)"; deviation detection - focuses on discovery of significant deviations between the actual contents of a data subset and its expected contents; clustering - classification scheme, where the classes are unknown; causation modeling - relationship of cause and effect among attributes; classification - each tuple belongs to a class, one of pre-defined set of classes; regression - similar to classification, the predicted value is rather continuous; summarisation - kind of summary, describing some properties shared by most of the tuples belonging to the same class. Conclusion In this paper we aimed to make an introductory overview for the field of knowledge discovery in databases with focus on one part of it - the data mining. As most of researchers agreed, the KDD is a process of several steps, where data preparation is as much important as the knowledge extraction itself. Less attention is given to evaluating and usage of extracted knowledge so here is a potential source for further issues. 5 Acknowledgements This work is supported by the European Commission within the INCO-COPERNICUS Project 977091 "Geographic Information On-Line Analysis: GIS - Data Warehouse Integration (GOAL)". 6 References [1] Brachman, R.J.; Anand, T. (1996): The Process of Knowledge Discovery in Databases. In Advances in Knowledge Discovery & Data Mining, Fayyad, U.M. - Piatetsky-Shapiro, G. Smyth, P. - Uthurusamy, R., Eds. AAAI/MIT Press, Cambridge, Massachusetts. [2] Chen, M.S.; Han, J.; Yu, P.S. (1996): Data Mining: An Overview from a Database Perspective. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, Vol.8, No.6, pp.866-883. [3] Fayyad, U.M. (1996): Data Mining and Knowledge Discovery: Making Sense Out of Data. IEEE EXPERT, Vol.11, No.5, pp. 20-25. [4] Fayyad, U.M.; Piatetsky-Shapiro, G.; Smyth, P. (1996): From Data Mining to Knowledge Discovery: An Overview. In Advances in Knowledge Discovery & Data Mining, Fayyad, U.M.; Piatetsky-Shapiro, G.; Smyth, P.; Uthurusamy, R., Eds. AAAI/MIT Press, Cambridge, Massachusetts. [5] Fayyad, U.M.; Piatetsky-Shapiro, G.; Smyth, P. (1996): The KDD Process for Extracting Useful Knowledge from Volumes of Data. COMMUNICATIONS OF THE ACM, Vol.39, No.11, pp. 27-34. [6] Freitas, A.A. (1997): Generic, Set-Oriented Primitives to Support Data-Parallel Knowledge Discovery in Relational Database Systems. Ph.D. Thesis, University of Essex, UK. [7] Freitas, A.A.; Lavington, S.H. (1998): Mining Very Large Databases with Parallel Processing. Kluwer Academic Publishers, 1998, chapter Knowledge Discovery Paradigms. Table of contents on: http://www.ppgia.pucpr.br/~alex/book.html [8] Hedberg, S. R. (1996): Searching for the mother lode: tales of the first data miners. IEEE EXPERT, Vol.11, No.5, pp. 4-7. [9] Kohavi, R.; John, G. (1998): The Wrapper Approach. Book Chapter in Feature Selection for Knowledge Discovery and Data Mining. (Kluwer International Series in Engineering and Computer Science), Huan Liu and Hiroshi Motoda, editors. [10] Mannila, H. (1997): Methods and Problems in Data Mining. In the proceedings of International Conference on Database Theory, Afrati, F. - Kolaitis, P., Delphi, Springer-Verlag. [11] Mark, B. (1996): Data mining - Here we go again? IEEE EXPERT, Vol. 11, No.5. [12] Simoudis, E. (1996): Reality Check for Data Mining. IEEE EXPERT, Vol.11, No.5 [13] Weiss, S.M.; Indurkhya, N. (1998): Predictive Data Mining. Morgan Kaufmann Publishers, Inc., San Francisco.