Knowledge Discovery in Databases:

advertisement
Knowledge Discovery in Databases: A Comparison of Different Views.
Eva Andrássyová, MSc.
Dept. of Cybernetics and Artificial Intelligence
Technical University of Košice, Slovakia
andrassy@tuke.sk
Ján Paralič, PhD.
Dept. of Cybernetics and Artificial Intelligence
Technical University of Košice, Slovakia
paralic@tuke.sk
The field of knowledge discovery in databases (KDD) is getting to be very popular and has grown
recently. The large amounts of data collected and stored might contain some information, which
could be useful, but it is not obvious to recognise, nor trivial to obtain it. There is no human capable
to sift through such amounts of data and even some existing algorithms are inefficient when trying to
solve this task. KDD systems incorporate techniques from large variety of related fields to utilise
their strengths in process of discovering knowledge.
Working on the international project GOAL1 (Geographic Information On-Line Analysis: GIS - Data
Warehouse Integration) we have studied several publications to obtain an idea what the KDD
(process) is and what it is not as well. We have studied which techniques are applicable in this
process, what tasks are to solve and which particular steps the process should take. Interdisciplinary
nature of KDD causes that terminology used varies from source to source.
The aim of this paper is to compare the notions and definitions of KDD within sources we had
studied and to point out the similarities and the differences. From the particular steps of the KDD
process we have focused on data mining step. (KDD is often misleadingly called data mining.) An
attempt to link together techniques and methods as well as tasks listed in each source under different
names is presented here in form of tables. We have made conclusions in sense that we have chosen
the best views for our later use.
1
Introduction
Huge amount of data collected in various processes, either manufacturing or business, (often as a side
effect of computerisation) should be thoroughly analysed as they might contain some precious
information for decision support. There is nothing new about analysing data, but it is in the amount of
data, where traditional methods are becoming inefficient. It is often misleadingly believed that data
mining is the new powerful technology. "The new is the confluence of (fairly) mature offshoots of
such technologies as visualisation, statistics, machine learning and deductive databases, at a time
when the world is ready to see their value." [11]
2
2.1
The process of KDD
Definition
When studying literature with topic of data mining we have encountered with terms such like: data
mining, knowledge discovery in databases or abbreviation KDD. In various sources those terms
are explained on rather different way. Below are listed some of them to show their variety.
Quite clear definition of data mining is presented in [12] (Simoudis):
1
INCO-COPERNICUS Project 977091.
Data mining - the process of extracting valid, previously unknown, comprehensible, and actionable
information from large databases and using it to make crucial business decisions.
Different view is presented in [10] (Mannila) where definition is as follows:
Knowledge discovery in databases (often called data mining) aims at the discovery of useful
information from large collections of data. In addition the author puts special stress on fact that the
task of KDD is inherently interactive and iterative, and it is a process containing several steps where
data mining is one of them. (However in the rest of this article it is difficult to distinguish between
KDD and DM.)
According to [8] (Hedberg) KDD is abbreviation of knowledge discovery and data mining, which
may lead to confusion.
In our opinion the most sophisticated definition is one according to [5] (Fayyad et al.), where authors
have determined that knowledge discovery in databases is interactive and iterative process with
several steps and data mining is a part of this process. Process of KDD is defined as:
The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable
patterns in data.
The terms of above definition are explained as follows:
pattern
- models or structure in data (traditional sense)
- expression in some language describing a subset of the data or a model applicable to that
subset (data comprises a set of facts)
process
- implies there are many steps repeated in multiple iterations
nontrivial (process)
- it must involve search for structure, models, patterns, or parameters
valid
- discovered patterns should be valid for new data with some degree of certainty
novel
- at least to the system and preferably to the user
potentially useful
- for the user or task
understandable
- discovered patterns should be understandable - if not immediately, then after some
postprocessing.
Authors suggest that this definition implies the way of defining quantitative measures for evaluation
of extracted patterns based on required and obtained notions. For validity we can define measure of
certainty or utility (gain in some currency, due to better prediction). Such notions as novelty and
understandability are more subjective, in some cases understandability can be estimated by
simplicity (number of bits needed to describe a pattern). Interestingness is name of the notion for
overall measure, which includes validity, novelty, usefulness and simplicity. Interestingness functions
can be explicitly defined or manifested implicitly (ordering of discovered patterns by KDD system).
In the rest of studied sources there are no measures considered for discovered patterns evaluation
(namely [12]). In some views ([10] for instance) the evaluation should be left on user who is given all
patterns that satisfy the user specifications and have sufficient frequency in the data. In author's
opinion this is an advantage of such a system as every user has different subjective measure for
interestingness according to his prior knowledge.
In most of sources the term Data Mining (DM) is often used to name the field of knowledge
discovery. This confusing use of terms KDD and DM is due to historical reasons and due to fact that
the most of the work is focused on refinement and applicability experiments of ML and AI
algorithms for the data mining step. Preprocessing is often included in this step as a part of mining
algorithm.
2.2
Steps of KDD process
According to definition above, the KDD is an interactive and iterative process. It means that at any
stage the user should have possibility to make changes (for instance to choose different task or
technique) and repeat the following steps to achieve better results. In Tab. 1 are listed particular steps
of the KDD where we compared the terms of different sources. The table is organised on the way that
the terms in the row refer to the same action.
Simoudis
[12]
data selection
data transformation
data mining
result interpretation
Mannila
[10]
understanding the
domain
Fayyad et al.
[5]
learning the
application domain
creating a target
dataset
data cleaning and
preprocessing
data reduction and
preparing the data set
projection
choosing the function
of data mining
choosing the data
mining
algorithm(s)
discovering patterns
(data mining)
data mining
postprocessing of
interpretation
discovered patterns
putting the results
using discovered
into use
knowledge
Tab. 1 The process of KDD - list of steps.
Brachman & Anand
[1]
task discovery
data discovery
data cleaning
model development
data analysis
output generation
Data mining (the dark grey coloured row) gets the most attention in research, therefore in
publications as well. Those are mostly focused on learning algorithms, some methods combine data
mining with previous data preparation (light grey coloured row), which is usually dataset reduction.
The KDD process according to [1] is outlined on Fig. 1. The first two steps of the KDD process,
namely task discovery and data discovery, produce the first input (goal of the KDD process). The
following steps in the KDD process are data cleaning, model development, data analysis and output
generation. In the following the inputs and steps of a KDD process according to [1] will be described
in more details.
Task Discovery is one of first steps of KDD. Client has to state the problem or goal, which often
seems to be clear. Further investigation is recommended such as to get acquainted with customer's
organisation after spending some time at the place and to sift through the raw data (to understand its
form, content, organisational role and sources of data). Then the real goal of the discovery will be
found.
Data Discovery is complementary to step of task discovery. In the step of data discovery, we have to
decide whether quality of data is satisfactory for the goal (what data does or does not cover).
Task
Discovery
Report
Domain Model
Goal
Model
Development
Data
Cleaning
Data
Analysis
Output
Generation
Data
Discovery
Action
Model
Data Dictionary
Database
Query
tools
Input
Statistics &
AI tools
Visualisation
tools
Output
Process flow
Presentation
tools
Tool
Data flow
Data
transformation
tools
Monitor
Process task
Tool usage
Fig. 1 Schema of the KDD process.
Domain Model plays an important role in the KDD process, though it often remains in the mind of
the expert. A data dictionary, integrity constraints and various forms of metadata from the DBMS can
possibly contribute to retrieval of the background knowledge for the KDD purposes as well as some
analysis techniques. Those can take advantage of formally represented knowledge when fitting data
to a model (for example ML techniques such as explanation-based learning integrated with inductive
learning techniques).
Data Cleaning is often necessary though it may happen that something removed by cleaning can be
indicator of some interesting domain phenomenon (outlier or key data point?). Analyst's background
knowledge is crucial in data cleaning provided by comparisons of multiple sources. Other way is to
clean data before loaded into database by editing procedures. Recently, the data for KDD are coming
form data warehouses which contain data already cleaned on some way.
Model Development is an important phase of KDD that must precede actual analysis of the data.
Interaction with the data leads analysts to formation of hypothesis (it is often based on experience and
background knowledge). Sub-processes of model development are:

data segmentation (unsupervised learning techniques, for example clustering);

model selection (choosing the best type of model after exploring several different types);

parameter selection (parameters of chosen model).
Data Analysis is in general an ambition to understand why certain groups of entities are behaving on
the way they do, it is search for laws or rules of such behaviour. As first should be analysed those
parts where such a groups are already identified. Sub-processes in data analysis are:

model specification - some formalism is used to denote specific model;

model fitting - when necessary the specific parameters are determined (in some cases the model
is independent from data in other cases the model has to be fitted to training data);

evaluation - model is evaluated against the data;

model refinement - model is refined in iterations according to the evaluation results.
As mentioned above the model development and data analysis are complementary so it often leads to
oscillation between those two steps.
Output Generation - output can be in various forms. The simplest form is a report with analysis
results. The other, more complicated forms, are graphs or in some cases it is desirable to obtain action
descriptions which might be taken directly as outputs. Or there should be a monitor as the output,
which should trigger an alarm or action under some certain condition. Output requirements might
determine task of designed KDD application.
3
Tasks of DM
For the title of this section we took the term tasks of DM, though we went thorough different terms as
shown in the Tab. 2. This table is organised on the way that the tasks in a row (from different
sources) refer to the same task. This organisation is based on particular description of table items
available in particular source.
Freitas
[6]
tasks
discovery of SQO
rules
discovery
of database
dependencies
discovery of
association rules
dependence
modeling
Simoudis
[12]
operations
query and
reporting
multidimensional
analysis
Fayyad et al.
[5]
model functions
link analysis
(associations or
relations between
the records)
causation
modeling
classification
regression
summarisation
database
segmentation
(clustering)
association rules
dependency
modeling
dependency
modeling
change and
deviation detection
clustering
clustering
finding episodes
from sequences
sequence analysis
predictive
modelling
(C4.5, NN)
statistical analysis
(EDA2)
Mannila
[10]
problem
finding keys or
functional
dependencies
link analysis
deviation detection deviation detection
clustering
Fayyad et al.
[4]
tasks
classification
classification
regression
regression
summarisation
(EDA)
Tab. 2 Tasks of DM.
summarisation
(EDA)
We accepted the list of tasks in the first column as a standard and a brief description of DM tasks is
as follows (For more details see [6].):
 discovery of SQO rules - to perform a syntactical transformation of the incoming query to
produce more efficient query by adding or removing conjuncts; characteristic for SQO rules is
that the query processing time ( derived from access method and indexing scheme of DBMS) is
take into account as cost of attribute;
2
Exploratory Data Analysis









4
discovery of database dependencies - in this case the term refers to relationships among
attributes of relations;
discovery of association rules - relationship of sets of items, those are assigned by support and
confidence factor;
dependence modeling - dependencies among attributes in form of if-then rules as "if (A is true)then (C is true)";
deviation detection - focuses on discovery of significant deviations between the actual contents
of a data subset and its expected contents;
clustering - classification scheme, where the classes are unknown;
causation modeling - relationship of cause and effect among attributes;
classification - each tuple belongs to a class, one of pre-defined set of classes;
regression - similar to classification, the predicted value is rather continuous;
summarisation - kind of summary, describing some properties shared by most of the tuples
belonging to the same class.
Conclusion
In this paper we aimed to make an introductory overview for the field of knowledge discovery in
databases with focus on one part of it - the data mining. As most of researchers agreed, the KDD is a
process of several steps, where data preparation is as much important as the knowledge extraction
itself. Less attention is given to evaluating and usage of extracted knowledge so here is a potential
source for further issues.
5
Acknowledgements
This work is supported by the European Commission within the INCO-COPERNICUS Project
977091 "Geographic Information On-Line Analysis: GIS - Data Warehouse Integration (GOAL)".
6
References
[1] Brachman, R.J.; Anand, T. (1996): The Process of Knowledge Discovery in Databases. In
Advances in Knowledge Discovery & Data Mining, Fayyad, U.M. - Piatetsky-Shapiro, G. Smyth, P. - Uthurusamy, R., Eds. AAAI/MIT Press, Cambridge, Massachusetts.
[2] Chen, M.S.; Han, J.; Yu, P.S. (1996): Data Mining: An Overview from a Database Perspective.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, Vol.8, No.6,
pp.866-883.
[3] Fayyad, U.M. (1996): Data Mining and Knowledge Discovery: Making Sense Out of Data.
IEEE EXPERT, Vol.11, No.5, pp. 20-25.
[4] Fayyad, U.M.; Piatetsky-Shapiro, G.; Smyth, P. (1996): From Data Mining to Knowledge
Discovery: An Overview. In Advances in Knowledge Discovery & Data Mining, Fayyad, U.M.;
Piatetsky-Shapiro, G.; Smyth, P.; Uthurusamy, R., Eds. AAAI/MIT Press, Cambridge,
Massachusetts.
[5] Fayyad, U.M.; Piatetsky-Shapiro, G.; Smyth, P. (1996): The KDD Process for Extracting Useful
Knowledge from Volumes of Data. COMMUNICATIONS OF THE ACM, Vol.39, No.11, pp.
27-34.
[6] Freitas, A.A. (1997): Generic, Set-Oriented Primitives to Support Data-Parallel Knowledge
Discovery in Relational Database Systems. Ph.D. Thesis, University of Essex, UK.
[7] Freitas, A.A.; Lavington, S.H. (1998): Mining Very Large Databases with Parallel Processing.
Kluwer Academic Publishers, 1998, chapter Knowledge Discovery Paradigms. Table of
contents on: http://www.ppgia.pucpr.br/~alex/book.html
[8] Hedberg, S. R. (1996): Searching for the mother lode: tales of the first data miners. IEEE
EXPERT, Vol.11, No.5, pp. 4-7.
[9] Kohavi, R.; John, G. (1998): The Wrapper Approach. Book Chapter in Feature Selection for
Knowledge Discovery and Data Mining. (Kluwer International Series in Engineering and
Computer Science), Huan Liu and Hiroshi Motoda, editors.
[10] Mannila, H. (1997): Methods and Problems in Data Mining. In the proceedings of International
Conference on Database Theory, Afrati, F. - Kolaitis, P., Delphi, Springer-Verlag.
[11] Mark, B. (1996): Data mining - Here we go again? IEEE EXPERT, Vol. 11, No.5.
[12] Simoudis, E. (1996): Reality Check for Data Mining. IEEE EXPERT, Vol.11, No.5
[13] Weiss, S.M.; Indurkhya, N. (1998): Predictive Data Mining. Morgan Kaufmann Publishers, Inc.,
San Francisco.
Download