Knowledge Discovery in Databases

advertisement
Knowledge Discovery in Databases
Alaaeldin M. Hafez
Department of Computer Science and Automatic Control
Faculty of Engineering
Alexandria University
Abstract. Business information received from advanced data analysis and
data mining is a critical success factor for companies wishing to maximize
competitive advantage. The use of traditional tools and techniques to
discover knowledge is ruthless and does not give the right information at
the right time. Knowledge discovery is defined as ``the non-trivial
extraction of implicit, unknown, and potentially useful information from
data''. There are many knowledge discovery methodologies in use and
under development. Some of these techniques are generic, while others are
domain-specific. In this paper, we present a review outlining the state-ofthe-art techniques in knowledge discovery in database systems
1.
Introduction
Recent years have seen an enormous increase in the amount of information stored in
electronic format. It has been estimated that the amount of collected information in the
world doubles every 20 months and the size and number of databases are increasing even
faster and the ability to rapidly collect data has outpaced the ability to analyze it.
Information is crucial for decision making, especially in business operations. As a
response to those trends, the term 'Data Mining' (or 'Knowledge Discovery') has been
coined to describe a variety of techniques to identify nuggets of information or decisionmaking knowledge in bodies of data, and extracting these in such a way that they can be
put to use in the areas such as decision support, prediction, forecasting and estimation.
Automated tools must be developed to help extract meaningful information from a flood
of information. Moreover, these tools must be sophisticated enough to search for
correlations among the data unspecified by the user, as the potential for unforeseen
relationships to exist among the data is very high. A successful tool set to accomplish
these goals will locate useful nuggets of information in the otherwise chaotic data space,
and present them to the user in a contextual format.
An urgent need for creating a new generation of techniques is needed for automating data
mining and knowledge discovery in databases (KDD). KDD is a broad area that
integrates methods from several fields including statistics, databases, AI, machine
learning, pattern recognition, machine discovery, uncertainty modeling, data
visualization, high performance computing, optimization, management information
systems (MIS), and knowledge-based systems.
1
The term “Knowledge discovery in databases” is defined as the process of identifying
useful and novel structure (model) in data [1, 3, 4, 17, 31]. It could be viewed as a multistage process. Those stages are summarized as follows:






Data gathering, e.g., databases, data warehouses, Web crawling.
Data cleansing; eliminate errors, e.g., GPA = 7.3.
Feature extraction; obtaining only the interesting attributes of data
Data mining; discovering and extracting meaningful patterns.
Visualization of data.
Verification and evaluation of results; drawing conclusions.
Data
Gathering
Databases
Data
Warehouses
Data
Cleansing
Web
Crawlers
Feature
Extraction
Verification and
evaluation
Data
Mining
Visualization
Data mining is considered as the main step in the knowledge discovery process that is
concerned with the algorithms used to extract potentially valuable patterns, associations,
trends, sequences and dependencies in data [1, 3, 4, 7, 22, 33, 34, 37, 39]. Key business
examples include web site access analysis for improvements in e-commerce advertising,
fraud detection, screening and investigation, retail site or product analysis, and customer
segmentation.
Data mining techniques can discover information that many traditional business analysis
and statistical techniques fail to deliver. Additionally, the application of data mining
techniques further exploits the value of data warehouse by converting expensive volumes
of data into valuable assets for future tactical and strategic business development.
Management information systems should provide advanced capabilities that give the user
the power to ask more sophisticated and pertinent questions. It empowers the right people
by providing the specific information they need.
Data mining techniques could be categorized either by tasks or by methods.
1.1
Data mining tasks:

Association Rule Discovery.
2





Classification.
Clustering.
Sequential Pattern Discovery.
Regression.
Deviation Detection
Most researchers refer to the first three data mining tasks as the main data mining tasks.
The rest are either related more to some other fields, such as regression, or considered as
a sub-field in one of the main tasks, such as sequential pattern discovery and deviation
detection.
1.2
Data Mining Methods:







Algorithms for mining spatial, textual, and other complex data
Incremental discovery methods and re-use of discovered knowledge
Integration of discovery methods
Data structures and query evaluation methods for data mining
Parallel and distributed data mining techniques
Issues and challenges for dealing with massive or small data sets
Fundamental issues from statistics, databases, optimization, and information
processing in general as they relate to problems of extracting patterns and models
from data.
Because of the limited space, we limit our discussion on the main techniques in data
mining tasks.
In section 2, we briefly discuss and give examples of the various data mining tasks. In
sections 3, 4 and 5, we discuss the main features and techniques in association mining,
classification and clustering, respectively. The paper is concluded in section 6.
2
Data Mining Tasks
2.1
Association Rule Discovery
Given a set of records each of which contains some set of items from a given collection
of items. Produce dependency rules that predict occurrences of one item based on the
occurrences of some other items [1, 3,27, 28].
Transaction ID
1
2
3
4
5
6
Items
Milk, Bread, Diaper
Milk, Bread
Coke, Milk, Bread, Diaper
Coke, Coffee, Bread
Bread, Coke
Milk, Coke, Coffee
3
Some of the Discovered Rules are:
Bread  Milk
Milk, Bread  Diaper
Example 2.1: (Supermarket shelf management)
Identify items that are bought together by sufficiently many customers.
Approach:


Process the point-of-sale data collected with barcode scanners to find
dependencies among items.
Stack those items mostly likely to buy together next to each other.
Example 2.2: (Inventory Management)
A consumer appliance repair company wants to anticipate the nature of repairs on its
consumer products and keep the service vehicles equipped with right parts to reduce on
number of visits to consumer households.
Approach:

2.2
Process the data on tools and parts required in previous repairs at different
consumer locations and discover the co-occurrence patterns.
Classification
Given a collection of records (training set), each record contains a set of attributes [10,
12, 30, 32, 35],




one of the attributes is the class (Classifier),
find a model for class attribute as a function of the values of other attributes.
previously unseen records should be assigned a class as accurately as possible.
a test set is used to determine the accuracy of the model. Usually, the given data
set is divided into training and test sets, with training set used to build the model
and test set used to validate it.
Test Set
Training
Set
Classifier
Model
4
Example 2.3: (Marketing)
Reduce cost of mailing by targeting a set of consumers likely to buy a new car.
Approach:



Use the data for a similar product introduced before.
Define the class attribute, which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class attribute.
For all such customers, collect various lifestyle, demographic, and business
related information such as salary, type of business, city, etc.
Use this information as input attributes to learn a classifier model.
Example 2.4: (Fraud Detection)
Predict fraudulent cases in credit card transactions.
Approach:




Use credit card transactions and the information on its account-holder as
attributes, such as, when does a customer buy, what does he buy, how often he
pays on time, etc.
Label past transactions as fraud or fair transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card transactions on an
account.
2.3 Clustering
Given a set of data points, each having a set of attributes, and a similarity measure among
them [24, 25, 28, 41]. Find clusters such that


Data points in one cluster are more similar to one another.
Data points in separate clusters are less similar to one another.
Where similarity measures are


Euclidean Distance if attributes are continuous.
Other Problem-specific Measures.
5
Example 2.5: (Market Segmentation)
Subdivide a market into distinct subsets of customers where any subset may conceivably
be selected as a market target to be reached with a distinct marketing mix.
Approach:



Collect different attributes of customers based on their geographical and lifestyle
related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying patterns of customers in same
cluster vs. those from different clusters.
Example 2.6: (Document Clustering)
Find groups of documents that are similar to each other based on the important terms
appearing in them.
Approach:


Identify frequently occurring terms in each document. Form a similarity measure
based on the frequencies of different terms. Use it to cluster.
Information Retrieval can utilize the clusters to relate a new document or search
term to clustered documents.
2.4 Sequential Pattern Discovery
Given a set of event sequence, find rules that predict strong sequential dependencies
among different events [4, 20, 37]. Rules are formed by first discovering patterns. Event
occurrences in the patterns are governed by timing constraints.
Example 2.7: (Telecommunications Alarm)
In telecommunication alarm logs, Rectifier_Alarm  Fire_Alarm
2.5
Regression
Regression techniques are important in statistics and neural network fields [10].
Assuming a linear or nonlinear model of dependency, they are used in predicting a value
of a given continuous valued variable based on the values of other variables.
6
Example 2.8:
 Predicting sales amounts of new product based on advertising expenditure.
 Predicting wind velocities as a function of temperature, humidity, air pressure,
etc.
 Time series prediction of stock market indices.
2.6
Deviation Detection
Deviation detection deals with discovering the most significant changes that are often
infrequent, in data from previously measured values.
Example 2.9:
Outlier detection in statistics.
3.
Association Rule Discovery
Association rule discovery or association mining that discovers dependencies among
values of an attribute was introduced by Agrawal et al.[1] and has emerged as a
prominent research area. The association mining problem also referred to as the market
basket problem can be formally defined as follows. Let I = {i1,i2, . . . , in} be a set of
items as S = {s1, s2, . . ., sm} be a set of transactions, where each transaction si S is a set
of items that is si  I. An association rule denoted by X  Y, where X,Y  I and X  Y
= , describes the existence of a relationship between the two itemsets X and Y.
Several measures have been introduced to define the strength of the relationship between
itemsets X and Y such as support, confidence, and interest. The definitions of these
measures, from a probabilistic model are given below.

Support (X  Y )  P( X , Y ) ;
the percentage of transactions in the database that
contain both X and Y.

Confidence(X  Y )  P( X , Y ) / P( X ) ;
the percentage of transactions containing Y in
transactions those contain X.

Interest(X  Y )  P( X , Y ) / P( X ) P(Y ) ,
represents
a
test
of
statistical
independence.
The problem of finding all association rules that have support and confidence greater than
some user-specified minimum support and confidence out of database D, could be
viewed as

Find all sets of items (itemsets) that have transaction support above minimum
support. The support for an itemset is the number of transactions that contain the
itemset. Itemsets with minimum support are called large itemsets. An itemset of
size k is a k-itemset.
7

Use the large itemsets to generate the desired association rules with minimum
confidence.
Most association mining algorithms concentrate on finding large itemsets, and consider
generating association rules as a straightforward procedure.
3.1 Spectrum of Association Mining Techniques
Many algorithms [1, 2, 3, 5, 7, 11, 22, 23, 33, 34, 37, 39], have been proposed to generate
association rules that satisfy certain measures. A close examination of those algorithms
reveals that the spectrum of techniques that generate association rules, has two extremes:
 A transaction data file is repeatedly scanned to generate large itemsets. The scanning
process stops when there are no more itemsets to be generated.
 A transaction data file is scanned only once to build a complete transaction lattice.
Each node on the lattice represents a possible large itemset. A count is attached to
each node to reflect the frequency of itemsets represented by nodes.
In the first case, since the transaction data file is traversed many times, the cost of
generating large itemsets is high. In the later case, while the transaction data file is
traversed only once, the maximum number of nodes in the transaction lattice is 2n , n is
the cardinality of I, the set of items. Maintaining such a structure is expensive.
In association mining, the subset relation 
of item-sets. Also, the subset relation 
itemsets. In other words, if B is a frequent itemset, then all subsets A  B are also
frequent. Association mining algorithms are different in the way they search the itemset
lattice spanned by the subset relation. Most approaches use a level-wise or bottom-up
search of the lattice to enumerate the frequent itemsets. If long frequent itemsets are
expected, a pure top-down approach might be preferred. Some have proposed a hybrid
search, which combines top-down and bottom-up approaches.
3.2
New Candidates Generation
Association mining algorithms can differ in the way they generate new candidates. A
complete search, the dominant approach, guarantees that we can generate and test all
frequent subsets. Here, complete doesn’t mean exhaustive; we can use pruning to
eliminate useless branches in the search space. Heuristic generation sacrifices
completeness for the sake of speed. At each step, it only examines a limited number of
“good” branches. Random search to locate the maximal frequent itemsets is also possible.
Methods that can be used here include genetic algorithms and simulated annealing.
Because of a strong emphasis on completeness, association mining literature has not
given much attention to the last two methods.
Also, association mining algorithms differ depending on whether they generate all
frequent subsets or only the maximal ones. Identifying the maximal itemsets is the core
8
task, because an additional database scan can generate all other subsets. Nevertheless, the
majority of algorithms list all frequent itemsets.
3.3
Association Mining Algorithms
3.3.1 The Apriori Algorithm
The Apriori algorithm [1] is considered as the most famous algorithms in the area of
association mining. Apriori starts with a seed set of itemsets found to be large in the
previous pass, and uses it to generate new potentially large itemsets (called candidate
itemsets). The actual support for these candidates is counted during the pass over the
data, and non-large candidates are thrown out. The main outlines of the Apriori algorithm
are described below.
Apriori algorithm:
 First pass: counts item occurrences to determine the large 1-itemsets.
 Second and subsequent passes:
for (k=2; Lk-1 not empty; k++)



Ck = apriori-gen(Lk-1); // (New candidates)
forall transactions t in D do
 Ct = subset(Ck,t) // (Candidates contained in t)
 forall candidates c in Ct do
o c.count++
Lk = {c in Ck | c.count >= minsupport }
 Answer = Unionk(Lk)
where



Lk: Set of large k-itemsets (i.e. those with minimum support). Each member of set
has an itemset and a support count.
Ck: Set of candidate k-itemsets (potentially large). It has itemset and support count.
apriori-gen accepts the set of all large (k-1)-itemsets Lk-1, and returns a superset of
the set of all large k-itemsets. First, it performs a join of Lk-1 to itself to generate Ck,
and then prunes all itemsets from Ck such that some (k-1)-subset of that itemset is
not in Lk-1.
In Apriori algorithm, the transaction data file is repeatedly scanned to generate large
itemsets. The scanning process stops when there are no more itemsets to be generated.
For large databases, the Apriori algorithm requires extensive access of secondary storage
that can become a bottleneck for efficient processing. Several Apriori based algorithms
have been introduced to overcome the excessive use of I/O devices.
9
3.3.2 Apriori Based Algorithms
The AprioriTid algorithm[4] is a variation of the Apriori algorithm. The AprioriTid
algorithm also uses the "apriori-gen" function to determine the candidate itemsets before
the pass begins. The main difference from the Apriori algorithm is that the AprioriTid
algorithm does not use the database for counting support after the first pass. Instead, the
set <TID, {Xk}> is used for counting. (Each Xk is a potentially large k-itemset in the
transaction with identifier TID.) The benefit of using this scheme for counting support is
that at each pass other than the first pass, the scanning of the entire database is avoided.
But the downside of this is that the set <TID, {Xk}> that would have been generated at
each pass may be huge. Another algorithm, called AprioriHybrid, is introduced in [3].
The basic idea of the AprioriHybird algorithm is to run the Apriori algorithm initially,
and then switch to the AprioriTid algorithm when the generated database (i.e. <TID,
{Xk}>) would fit in the memory.
3.3.3. The Partition Algorithm
In [34], Savasere et al. introduced the Partition algorithm. The Partition algorithm
logically partitions the database D into n partitions, and only reads the entire database at
most two times to generate the association rules. The reason for using the partition
scheme is that any potential large itemset would appear as a large itemset in at least one
of the partitions. The algorithm consists of two phases. In the first phase, the algorithm
iterates n times, and during each iteration, only one partition is considered. At any given
iteration, the function "gen_large_itemsets" takes a single partition and generates local
large itemsets of all lengths from this partition. All of these local large itemsets of the
same lengths in all n partitions are merged and then combined to generate the global
candidate itemsets. In the second phase, the algorithm counts the support of each global
candidate itemsets and generates the global large itemsets. Note that the database is read
twice during the process: once in the first phase and the other in the second phase, which
the support counting requires a scan of the entire database. By taking minimal number of
passes through the entire database drastically saves the time used for doing I/O.
3.3.4. The Dynamic Itemset Counting Algorithm
In the Dynamic Itemset Counting (DIC) algorithm [11], the database is divided into p
equal-sized partitions so that each partition fits in memory. For partition 1, DIC gathers
the supports of single items. Items found to be locally frequent (only in this partition)
generate candidate 2-itemsets. Then DIC reads partition 2 and obtains supports for all
current candidates—that is, the single items and the candidate 2-itemsets. This process
repeats for the remaining partitions. DIC starts counting candidate k-itemsets while
processing partition k in the first database scan. After the last partition p has been
processed, the processing wraps around to partition 1 again. A candidate’s global support
is known once the processing wraps around the database and reaches the partition where
it was first generated.
10
DIC is effective in reducing the number of database scans if most partitions are
homogeneous (have similar frequent itemset distributions). If data is not homogeneous,
DIC might generate many false positives (itemsets that are locally frequent but not
globally frequent) and scan the database more than Apriori does. DIC proposes a random
partitioning technique to reduce the data partition skew.
3.4
New Trends in Association Mining
Many knowledge discovery applications, such as on-line services and world wide web,
are dynamic and require accurate mining information from data that changes on a regular
basis. In world wide web, every day hundreds of remote sites are created and removed. In
such an environment, frequent or occasional updates may change the status of some rules
discovered earlier. Performance is the main consideration in data mining.
Discovering knowledge is an expensive operation. It requires extensive access of
secondary storage that can become a bottleneck for efficient processing. Two solutions
have been proposed to improve the performance of the data mining process,
 Dynamic data mining
 Parallel data mining
3.4.1 Dynamic Data Mining
Using previously discovered knowledge along with new data updates to maintain
discovered knowledge could solve many problems, that have faced data mining
techniques; that is, database updates, accuracy of data mining results, gaining more
knowledge and interpretation of the results, and performance.
In [33], a dynamic approach that dynamically updates knowledge obtained from the
previous data mining process was introduced. Transactions over a long duration are
divided into a set of consecutive episodes. Information gained during the current episode
depends on the current set of transactions and the discovered information during the last
episode. The approach discovers current data mining rules by using updates that have
occurred during the current episode along with the data mining rules that have been
discovered in the previous episode.
SUPPORT for an itemset S is calculated as SUPPORT ( S )  F ( S )
F
where F(S) is the number of transactions having S, and F is the total number of
transactions.
For a minimum SUPPORT value MINSUP, S is a large (or frequent) itemset if
SUPPORT(S)  MINSUP, or F(S)  F*MINSUP.
11
Suppose we have divided the transaction set T into two subsets T1 and T2,
corresponding to two consecutive time intervals, where F1 is the number of transactions
in T1 and F2 is the number of transactions in T2, (F=F1+F2), and F1(S) is the number of
transactions having S in T1 and F2(S) is the number of transactions having S in T2,
(F(S)=F1(S)+F2(S)). By calculating the SUPPORT of S, in each of the two subsets, we
get
F (S )
F ( S ) and
SUPPORT2 ( S )  2
SUPPORT1 ( S )  1
F2
F1
S is a large itemset if
F1 ( S )  F2 ( S )
 MINSUP
F1  F2
, or
F1 ( S )  F2 ( S )  ( F1  F2 )* MINSUP
In order to find out if S is a large itemset or not, we consider four cases,
 S is a large itemset in T1 and also a large itemset in T2, i.e., F1 ( S )  F1 * MINSUP and
F2 ( S )  F2 * MINSUP .
 S is a large itemset in T1 but a small itemset in T2, i.e., F1 ( S )  F1 * MINSUP and
F2 ( S )  F2 * MINSUP .
 S is a small itemset in T1 but a large itemset in T2, i.e., F1 ( S )  F1 * MINSUP and
F2 ( S )  F2 * min sup .
 S is a small itemset in T1 and also a small itemset in T2, i.e., F1 ( S )  F1 * MINSUP and
F2(S)< F2*MINSUP.
In the first and fourth cases, S is a large itemset and a small itemset in transaction set
T, respectively, while in the second and third cases, it is not clear to determine if S is a
small itemset or a large itemset. Formally speaking, let SUPPORT(S) = MINSUP + ,
where   0 if S is a large itemset, and   0 if S is a small itemset. The above four cases
have the following characteristics,
 1  0 and 2  0
 1  0 and 2  0
 1  0 and 2  0
 1  0 and 2  0
S is a large itemset if
F1 * ( MINSUP   1 )  F2 * ( MINSUP   2 )
 MINSUP , or
F1  F2
F1 * ( MINSUP   1 )  F2 * ( MINSUP   2 )  MINSUP * ( F1  F2 )
which can be written as F1 *  1  F2 *  2  0
Generally, let the transaction set T be divided into n transaction subsets Ti 's, 1  i  n. S
n
is a large itemset if  Fi *  i  0 , where Fi is the number of transactions in Ti and i =
i 1
SUPPORTi(S) - MINSUP, 1  i  n. -MINSUP  i  1-MINSUP, 1  i  n.
n
For those cases where 
i 1
Fi *  i  0 ,
there are two options, either
12
 discard S as a large itemset (a small itemset with no history record maintained), or
 keep it for future calculations (a small itemset with history record maintained). In
this case, we are not going to report it as a large itemset, but its
formula
 F *
n
i 1
i
i
will be maintained and checked through the future intervals.
3.4.2. Parallel Association Mining algorithms
Association mining is computationally and I/O intensive [2, 18]. Data is huge in terms of
number of items and number of transactions, one of the main features needed in
association ming is scalability. For large databases, sequential algorithms cannot provide
scalability, in terms of the data size or runtime performance. Therefore, we must rely on
high-performance parallel computing.
In the literature, most parallel association mining algorithms [2, 18, 21] are based on their
sequential counterparts. Researchers expect parallelism to relieve current association
mining methods from the sequential bottleneck, providing scalability to massive data sets
and improving response time. The main challenges include synchronization and
communication minimization, workload balancing, finding good data layout and data
decomposition, and disk I/O minimization. The parallel design space spans three main
components:
 Memory Systems
 Type of Parallelism
 Load Balancing
3.4.2.1 Memory Systems
Three techniques for using multiple processors have been considered [2, 18, 21, 27, 28].
 Shared memory (where all processors access common memory). In shared
memory architecture, many desirable properties could be achieved. Each
processor has direct and equal access to all the system’s memory. Parallel
programs are easy to implement on such a system. Although shared memory
architecture offers programming simplicity, a common bus’s finite bandwidth can
limit scalability.
 Distributed memory (where each processor has a private memory). In distributed
memory architecture, each processor has its own local memory, which only that
processor can access directly. For a processor to access data in the local memory
of another processor, message passing must send a copy of the desired data
elements from one processor to the other. A distributed-memory, message-passing
architecture cures the scalability problem by eliminating the bus, but at the
expense of programming simplicity.
 Combine the best of the distributed and shared memory approaches. The physical
memory is distributed among the nodes but a shared global address space on each
13
processor is provided. Locally cached data always reflects any processor’s latest
modification.
3.4.2.2 Type of Parallelism
Task and data parallelism are the two main paradigms for exploiting algorithm
parallelism.
 Data parallelism corresponds to the case where the database is partitioned among
P processors— logically partitioned for shared memory architecture, physically
for distributed memory architecture. Each processor works on its local partition of
the database but performs the same computation of counting support for the
global candidate itemsets.
 Task parallelism corresponds to the case where the processors perform different
computations independently, such as counting a disjoint set of candidates, but
have or need access to the entire database. In shard memory architecture,
processors have access to the entire data, but for distributed memory architecture,
the process of accessing the database can involve selective replication or explicit
communication of the local portions.
Hybrid parallelism, which combines both task and data parallelism, is also possible and
perhaps desirable for exploiting all available parallelism in association mining methods.
3.4.2.3 Load Balancing
Two approaches are used in load balancing,
 static load balancing, and
 dynamic load balancing
In static load balancing, a heuristic cost function initially partitions work among the
processors. Subsequent data or computation movement is not available to correct load
imbalances. Dynamic load balancing takes work from heavily loaded processors and
reassigning it to lightly loaded ones. Dynamic load balancing requires additional costs for
work and data movement, and also for the mechanism used to detect whether there is an
imbalance. However, dynamic load balancing is essential if there is a large load
imbalance or if the load changes with time.
Dynamic load balancing is especially important in multi-user environments with transient
loads and in heterogeneous platforms, which have different processor and network
speeds. These kinds of environments include parallel servers and heterogeneous clusters,
meta-clusters, and super-clusters. All extant association mining algorithms use only static
load balancing that is inherent in the initial partitioning of the database among available
nodes. This is because they assume a dedicated, homogeneous environment.
14
4. Classification
Classification is an important problem in the rapidly emerging field of data mining [10,
19, 30, 32]. The problem can be stated as follows. We are given a training dataset
consisting of records. Each record is identified by a unique record id and consists of
fields corresponding to the attributes. An attribute with a continuous domain is called a
continuous attribute. An attribute with finite domain of discrete values is called a
categorical attribute. One of the categorical at-tributes is the classifying attribute or class
and the values in its domain are called class labels. Classification is the process of
discovering a model for the class in terms of the remaining attributes.
4.1 Serial Algorithms for Classification
Serial Algorithms for classification [10, 12, 19, 29, 30] are categorized as decision tree
based methods and non-decision tree based methods. Non-decision tree based methods
include neural networks, genetic algorithms, and Bayesian networks.
4.1.1. Decision Tree Based Classification
The decision tree models [5] are found to be most useful in the domain of data mining.
They yield comparable or better accuracy as compared to other models such as neural
networks, statistical models or genetic models [30]. Many advantages could be
considered such as
o
o
o
o
Inexpensive to construct
Easy to Interpret
Easy to integrate with database systems
Comparable or better accuracy in many applications
Example 4.1:
ID
1
2
3
4
5
6
7
8
9
10
Income
95K
120K
140K
80K
160K
100K
90K
75K
170K
125K
Marital Status
Married
Single
Single
Married
Divorced
Married
Single
Divorced
Divorced
Single
Refund
Yes
Yes
No
No
Yes
Yes
Yes
No
No
No
Cheat
NO
NO
YES
NO
NO
NO
NO
NO
YES
YES
15
Refund
Yes
No
NO
Marital Status
Single/Divorced
Married
NO
Income
>80K
<=80K
YES
NO
Many algorithms have been used in constructing decision trees. Hunt’s algorithm is one
of the earliest versions that have been used in building decision trees. Many other
algorithms such as, CART [10], ID3, C4.5 [32], SLIQ [29], and SPRINT [35], have
followed. Two phases are needed for the structure of the classification tree,
o
o
Tree Induction
Tree Pruning
4.1.1.1 Decision Tree Induction
For tree induction, records are split based on an attribute that optimizes the splitting
criterion. Splitting could be either on a categorical attribute or on a continuous attribute.
For categorical attributes, each partition has a subset of values signifying it. Two methods
are use to form partition. In the simple method, it uses as many partitions as distinct
values, while in the complex method, two partitions are used, and values are divided into
two subsets; one subset for each partition.
Marital
Status
Single
Divorced
Married
Simple Method
16
Marital Status
{Married}
{Single, Divorced}
Complex Method
For continuous attributes, splitting is done either by using the static approach, where
Apriori discretization is used to form a categorical attribute, or by using the dynamic
approach, where decisions are made as algorithm proceeds. The dynamic approach is
complex but more powerful and flexible in approximating true dependency. Dynamic
decisions are made as follows:


Form binary decisions based on one value; two partitions: A < v and A >= v
Find ranges of values and use each range to partition.
o Ranges can be found by a simple-minded bucketing, or by more intelligent
clustering techniques
Example: Salary in [0,15K), [15K, 60K), [60K, 100K), [100K,...]

Find a linear combination of multiple variables, and make binary decisions or
range-decisions based on it.
Different splitting criterions are used. CART, SLIQ and SPRINT use GINI index as a
splitting value. For n classes, the value of GINI index at node t is calculated as
n
GINI (t )  1   ( p( j / t )) 2
j 1
where p(j/t) is the relative frequency of class j at node t.
For node t,


GINI index value is maximum (i.e., 1-1/n) when records are equally distributed
among all classes, i.e., least interesting information.
GINI index value is minimum (i.e., 0) when all records belong to one class, i.e.,
most interesting information.
17
Example 4.2:
Count (Class1) = 0
Count (Class2) = 8
GINI = 0
Count (Class1) = 1
Count (Class2) = 7
GINI = 0.2187
Count (Class1) = 2
Count (Class2) = 6
GINI = 0.375
Count (Class1) = 3
Count (Class2) = 5
GINI = 0.4687
Count (Class1) = 4
Count (Class2) = 4
GINI = 0.5
Usually, when the GINI index is used, the splitting criterion is to minimize the GINI
index of the split. When a node e is split into k partitions, the quality of the split is
computed as
k
GINI split  
i 1
ni
GINI (i )
n
n is the number of records at node e, and ni is the number of records at node (partition) i.
To compute the GINI index, the following steps are used.
 Use Binary Decisions based on one value.
 Choose the splitting value, e.g., number of possible splitting values = number of
distinct values.
 Each splitting value has a count matrix associated with it, i.e., class counts in each
of the partitions, A < v and A >= v
 Choose best v, for each v, scan the database to gather count matrix and compute
its Gini index. This could be computationally inefficient (repetition of work.)
 For efficient computation: for each attribute,
o Sort the attribute on values/
o Linearly scan these values, each time updating the count matrix and
computing gini index.
o Choose the split position that has the least GINI index.
In C4.5 and ID3, another splitting criterion based on an information or entropy measure
(INFO). INFO based computations are similar to GINI index computations. For n classes,
the value of INFO at node t is calculated as
n
INFO(t )   p( j / t ) log( p( j / t ))
j 1
where p(j/t) is the relative frequency of class j at node t.
For node t,

INFO value is maximum (i.e., log (n)) when records are equally distributed
among all classes, i.e., least interesting information.
18

INFO value is minimum (i.e., 0) when all records belong to one class, i.e., most
interesting information.
Example 4.3:
Count (Class1) = 0
Count (Class2) = 8
INFO = 0
Count (Class1) = 1
Count (Class2) = 7
INFO = 0.16363
Count (Class1) = 2
Count (Class2) = 6
INFO = 0.24422
Count (Class1) = 3
Count (Class2) = 5
INFO = 0.28731
Count (Class1) = 4
Count (Class2) = 4
INFO = 0.30103
When a node e is split into k partitions, the information gain is computed as
k
GAIN split  INFO(e)  
i 1
ni
INFO(i )
n
n is the number of records at node e, and ni is the number of records at node (partition) i.
The splitting criterion used with INFO is to choose the split that achieves the most
reduction (maximize GAIN).
The decision tree based classifiers that handle large datasets improve the classification
accuracy. Some proposed classifiers SLIQ and SPRINT use entire dataset for
classification and are shown to be more accurate as compared to the classifiers that use
sampled dataset or multiple partitions of the dataset. The decision tree model is built by
recursively splitting the training set based on a locally optimal criterion until all or most
of the records belonging to each of the partitions bear the same class label. Briefly, there
are two phases to this process at each node of the decision tree. First phase determines the
splitting decision and second phase splits the data. The very difference in the nature of
continuous and categorical attributes requires them to be handled in different manners.
The handling of categorical attributes in both phases is straightforward. Handling the
continuous attributes is challenging. An efficient determination of the splitting decision
used in most of the existing classifiers requires these attributes to be sorted on values.
The classifiers such as CART and C4.5 perform sorting at every node of the decision
tree, which makes them very expensive for large datasets, since this sorting has to be
done out-of-core. The approach taken by SLIQ and SPRINT sorts the continuous
attributes only once in the beginning. The splitting phase maintains this sorted order
without requiring to sort the records again. The attribute lists are split in a consistent
manner using a mapping between a record identifier and the node to which it belongs
after splitting. SPRINT implements this mapping as a hash table, which is built on-the-fly
for every node of the decision tree. The size of this hash table is proportional to the
number of records at the node. For the upper levels of the tree, this number is O(N),where
N is the number of records in the training set. If the hash table does not fit in the main
memory, then SPRINT has to divide the splitting phase into several stages such that the
hash table for each of the phases fits in the memory. This requires multiple passes over
each of the attribute lists causing expensive disk I/O.
In the following table, we summarize the features and disadvantages of the some of the
decision tree based classifiers.
19
Classification
Technique
C4.5
SLIQ
Features



Simple depth-first construction.
Sorts Continuous Attributes at each node.
Needs entire data to fit in memory.
 Unsuitable for Large Datasets; Needs
out-of-core sorting.
 Classification Accuracy shown to
improve when entire datasets are used!

The arrays of the continuous attributes are
pre-sorted.
The classification tree is grown in a
breadth-first fashion.
Class List structure maintains the record-id
to node mapping.
Split determining Phase: Class List is
referred to for computing the best split for
each individual attribute. Computations for
all nodes are clubbed for efficiency
purposes (hence, breadth-first strategy).
Splitting Phase: The list of this splitting
attribute is used to update the leaf labels in
class list. (no physical splitting of attribute
lists among nodes)
 Class List is frequently and randomly
accessed in both the phases of tree
induction.
 So, it is required to be in-memory all
the time for efficient performance. This
limits the size of largest training set.




SPRINT
Disadvantages






The arrays of the continuous attributes are
pre-sorted. The sorted order is maintained
during each split.
The classification tree is grown in a
breadth-first fashion.
Class information is clubbed with each
attribute list.
Attribute lists are physically split among
nodes.
Split determining phase is just a linear scan
of lists at each node.
Hashing scheme used in splitting phase,
tids of the splitting attribute are hashed with
the tree node as the key, and lookup table,
remaining attribute arrays are split by
querying this hash structure.


Size of hash table is O(N) for top
levels of the tree.
If hash table does not fit in memory
(mostly true for large datasets), then
build in parts so that each part fits.
Multiple expensive I/O passes over
the entire dataset.
4.1.1.2 Decision Tree Pruning


Generalize the tree by removing statistical dependence on a particular training set.
The subtree with least estimated error rate is chosen leading to compact and
accurate representation
o
Separate sets for training and pruning.
o
Same dataset for tree building and tree pruning.
 Cross-validation (build tree on each subset and prune it using the
remaining). Not suitable for large datasets.
 All training samples for pruning – Minimum Description Length
(MDL) based pruning
20
MDL Based Tree Pruning

Cost(Model,Data) = Cost(Data|Model) + Cost(Model)
o Cost is the number of bits needed for encoding.
o Search for a least cost model.


Encode data in terms of number of classification errors.
Encode tree (model) using node encoding (number of children) plus splitting
condition encoding.
A node is pruned to have less number of children if removing some or all children
gives smaller Cost.


4.1.2 Non-Decision Tree Based Classification
4.1.2.1 Neural Networks
A neural network [30] is an analytic technique that mimics the working of the human
brain. Such a network consists of nodes (the neurons) and connections with a certain
"weight". By feeding the network with a training data set it adjusts its weights to achieve
sufficiently accurate predictions for expected future data sets.
The advantage of neural networks over decision-trees is the speed at which it can operate.
This increase in speed can be obtained by using the inherent parallel computations of the
network. A major disadvantage of the neural network is the inability to backtrack the
decision making process; a decision is calculated on the result of the initialisation of the
network from the training set. This is opposed to decision-tree methods where the path
leading to a decision is easily backtracked.
Input 1
Input 2
Output
(Class)
Input 3
Input 4
Input 5
Hidden
Layer
21
4.1.2.2 Bayesian Classifiers
The Bayesian Approach is a graphical model that uses directed arcs exclusively to form a
directed acyclic graph'' [13]. Although the Bayesian approach uses probabilities and a
graphical means of representation, it is also considered a type of classification.
Bayesian networks are typically used when the uncertainty associated with an outcome
can be expressed in terms of a probability. This approach relies on encoded domain
knowledge and has been used for diagnostic systems. The basic features of Bayesian
networks are




4.2
Each attribute and class label are random variables.
Objective is to classify a given record of attributes (A1, A2, …, An) to class C s.t.
P(C | A1, A2, …, An) is maximal.
Naive Bayesian Approach:
o Assume independence among attributes Ai.
o Estimate P(Ai | Cj) for all Ai and Cj.
o New point is classified to Cj if P(Cj) Pi P(Ai| Cj) is maximal.
Generic Approach based on Bayesian Networks:
o Represent dependencies using a direct acyclic graph (child conditioned on
all its parents). Class variable is a child of all the attributes.
o Goal is to get compact and accurate representation of the joint probability
distribution of all variables. Learning Bayesian Networks is an active
research area.
Parallel Formulations of Classification Algorithms
The memory limitations faced by serial classifiers and the need of classifying much
larger datasets in shorter times make the classification algorithm an ideal candidate for
parallelization. The parallel formulation, however, must address the issues of efficiency
and scalability in both memory requirements and parallel runtime [18, 26, 28, 36].
In constructing a parallel classifier, two partitioning philosophies should be considered,


Partitioning of data only, where large number of classification tree nodes gives
high communication cost.
Partitioning of classification tree nodes which satisfies the natural concurrency,
but
o it loads imbalance as the amount of work associated with each node
varies,
o child nodes use the same data as used by parent node which leads to loss
of locality and high data movement cost
22
Categorical attributes algorithms are classified into three different classes. The features of
these three classes are summarized below.
Class
Synchronous Tree Construction Approach
Features
 no data movement required
 high communication cost as tree becomes
bushy.
Partitioned Tree Construction Approach

Hybrid Algorithm


processors work independently once
partitioned completely.
 load imbalance and high cost of data
movement.
combines good features of two approaches.
adapts dynamically according to the size
and shape of trees.
For continuous attributes, three different approaches are used,



Sort continuous attributes at each node of the tree, as in C4.5.
Discretize continuous attributes, as in SPEC [36] (Srivastava, Han,
Kumar, and Singh, 1997) .
Use a pre-sorted list for each continuous attributes as in SPRINT
[35](Shafer, Agrawal, and Mehta, VLDB’96), and ScalParC [26](Joshi,
Karypis, and Kumar, IPPS’98).
Many parallel formulations of decision tree based classifiers have been developed [28,
35, 36]. Among these, the most relevant one is the parallel formulation of SPRINT, as it
requires sorting of continuous attributes only once. SPRINT’s design allows it to
parallelize the split determining phase effectively. The parallel formulation proposed for
the splitting phase, however, is inherently unscalable in both memory requirements and
runtime. It builds the required hash table on all processors by gathering the record_id-tonode mapping from all the processors. For this phase, the communication overhead per
processor is O(N), where N is the number of records in the training set. Apart from the
initial sorting phase, the serial runtime of a classifier is O(N). Hence, SPRINT is unscalable in run-time. It is un-scalable in memory requirements also, because the memory
requirement per processor is O(N), as the size of the hash table is of the same order as
the size of the training dataset for the upper levels of the decision tree, and it resides on
every processor.
5
Clustering
Clustering is grouping some points in some space into a small number of clusters [18, 24,
38, 41], each cluster consisting of points that are “near" in some sense.
23
Example 5.1: Documents may be thought of as points in a high-dimensional space,
where each dimension corresponds to one possible word. The position of a document in a
dimension is the number of times the word occurs in the document (or just 1 if it occurs,
0 if not). Clusters of documents in this space often correspond to groups of documents on
the same topic.
Example 5.2: Many years ago, during a cholera outbreak in London, a physician plotted
the location of cases on a map. The data indicated that cases clustered around certain
intersections, where there were polluted wells, not only exposing the cause of cholera, but
indicating what to do about the problem.
5.1
Distance Measures
To consider whether a set of points is close enough to be considered a cluster, we need a
distance measure D(x, y) to tell us how far points x and y are. The usual axioms for a
distance measure D are:
 D(x, x) = 0, a point is distance 0 from itself.
 D(x, y) =D(y, x), D is symmetric.
 D(x; y)  D(x; z) +D(z; y), the triangle inequality.
Often, our points may be thought to live in a k-dimensional Euclidean space, and the
distance between any two points, say x = [x1; x2; : : :; xk] and y = [y1; y2; : : :; yk] is given
in one of the usual manners:
 Common distance (“L2 norm"):
k
 (x
i 1
 Manhattan distance (“L1 norm"):
i
 yi ) 2
i
 yi |
k
| x
i 1
 Max of dimensions (“L norm"): max ik1 | xi  y i |
In some cases, when there is no Euclidean space to place the points, we need to have
other forms of distance measures.
Example 5.3: In DNA sequences, two sequences may be similar even though there are
some insertions and deletions as well as changes in some characters. For the two
sequences, abcde and bcdxye (they don't have any positions in common, and don't even
have the same length), we can define the distance function D(x,y) = |x| + |y| - 2
|LCS(x,y)|, where LCS stands for the longest common subsequence of x and y. In our
example, LCS(abcde; bcdxye) is bcde, of length 4, so D(abcde, bcdxye) =5+6- 2X 4 =3;
i.e., the strings are fairly close.
5.2
Approaches to Clustering
Clustering algorithms [24, 25, 38] are divided into two broad classes:
24


Centroid approaches. The centroid or central point of each cluster is estimated,
and points are assigned to the cluster of their nearest centroid.
Hierarchical approaches. Starting with each point as a cluster by itself, nearby
clusters are repeatedly merged,
The clustering algorithms are classified according to:
 whether or not they assume a Euclidean distance, and
 whether they use a centroid or hierarchical approach.
The following three algorithms are examples of the above classification.
 BFR: Centroid based; assumes Euclidean measure, with clusters formed by a
Gaussian process in each dimension around the centroid.
 GRGPF: Centroid-based, but uses only a distance measure, not a Euclidean space.
 CURE: Hierarchical and Euclidean, this algorithm deals with odd-shaped clusters.
In the following subsections, we give some of the popular clustering algorithms.
5.3
Centroid Clustering
5.3.1 The k-Means Algorithm
The k-Means algorithm [24, 25, 38] is a popular main-memory algorithm. k cluster
centroids are picked and points are assigned to the clusters by picking the closest centroid
to the point in question. As points are assigned to clusters, the centroid of the cluster may
migrate.
Example 5.4: In the following figure, we have five points in the 2-dimensional space.
For k=2, points 1 and 2 are assigned to the two clusters, and become their centroids for
the moment.
5
c
1
3
b
2
a
4
For point 3, suppose it is closer to 2, so 3 joins the cluster of 2, whose centroid moves to
the point indicated as a. Suppose that point 4 is closer to 1 than to a, so 4 joins 1 in its
cluster, whose center moves to b. Finally, 5 is closer to a than b, so it joins the cluster { 2,
3}, whose centroid moves to c.
25
5.3.2 The BFR Algorithm
Based on k-means, this algorithm [38] reads its data once, consuming a main-memoryfull at a time. The algorithm works best if the clusters are normally distributed around a
central point, perhaps with a different standard deviation in each dimension. Figure 17
suggests what the data belonging to a typical cluster in two-dimensions might look like.
A centroid, marked by +, has points scattered around, with the standard deviation  in the
horizontal dimension being twice what it is in the vertical dimension. About 70% of the
points will lie within the 1 ellipse; 95% will lie within 2, 99.9% within 3, and
99.9999% within 4.
1
2
+
A cluster consists of:
 A central core, the Discard set (DS). This set of points is considered certain to
belong to the cluster. All the points in this set are replaced by some simple
statistics, described below. Although called “discarded" points, these points in
truth have a significant effect throughout the running of the algorithm, since they
determine collectively where the centroid is and what the standard deviation of
the cluster is in each dimension.
 Surrounding sub-clusters, the Compression set (CS). Each sub-cluster in the CS
consists of a group of points that are sufficiently close to each other that they can
be replaced by their statistics, just like the DS for a cluster is. However, they are
sufficiently far away from any cluster's centroid, that we are not yet sure which
cluster they belong to.
 Individual points that are not part of a cluster or sub-cluster, the Retained set
(RS). These points can neither be assigned to any cluster nor can they be grouped
into a sub-cluster of the CS. They are stored in main memory, as individual
points, along with the statistics of the DS and CS.
The statistics used to represent each cluster of the DS and each sub-cluster of the CS are:
 The count of the number of points, N.
 The vector of sums of the coordinates of the points in each dimension. The vector
is called SUM, and the component in the ith dimension is SUMi .
 The vector of sums of squares of the coordinates of the points in each dimension,
called SUMSQ. The component in dimension i is SUMSQi .
26
For k dimensions, 2k +1 are needed to compute important statistics of a cluster or subcluster. The mean and variance in each dimension are
 The coordinate i of the centroid of the cluster in dimension i is SUMi / N.
SUMSQi
SUM i 2
 The variance in dimension i is
(
)
N
N
5.3.3 Fastmap
Fastmap [25] picks k pairs of points (ai, bi), each of which pairs serves as the “ends" of
one of the k axes of the k-dimension space. Using the law of cosines, we can calculate the
“projection" x of any point c onto the line ab, using only the distances between points,
not any assumed coordinates of these points in a plane. The diagram is
c
D(b, c)
D(a, c)
a
x
b
D(a, b)
And the formula is
x
D 2 ( a , c )  D 2 ( a ,b )  D 2 ( b , c )
2 D ( a ,b )
Having picked a pair of points (a, b) as an axis, part of the distance between any two
points c and d is accounted for by the projections of c and d onto line ab, and the
remainder of the distance is in other dimensions. If the projections of c and d are x and y,
respectively, then in the future (as we select other axes), the distance Dcurrent(c, d) should
be related to the given distance function D by
2
Dcurrent
( c, d )  D 2 ( c, d )  ( x  y ) 2
d
D(c, d)
Dcurrent(c, d)
c
a
b
x
y
27
The Fastmap algorithm computes for each point c, k projections; c1, c2,…, ck onto the k
axes, which are determined by pairs of points (a1, b1), (a2, b2), …, (ak, bk). For i = 1, 2, …,
k, do the following:
 Using the current distance Dcurrent, pick ai and bi, as follows:
o Pick a random point c.
o Pick ai to be the point as far as possible from c, using distance Dcurrent.
o Pick bi to be the point as far as possible from ai.
 For each point x, compute xi , using the law-of-cosines formula described above.
 Change the definition of Dcurrent to subtract the distance in the ith dimension as
well as previous dimensions. That is D current (x; y)  D 2 ( x, y)   ( X ( j )  y ( j ) ) 2
j i
5.4
Hierarchical Clustering
Hierarchical clustering [25, 38] is a general technique that could take, in the worst case,
O(n2) time to cluster n points. The General outlines of this approach are as follows.
 Start with each point in a cluster by itself.
 Repeatedly select two clusters to merge. In general, we want to pick the two
clusters that are closest, but there are various ways we could measure “closeness."
Some possibilities:
o Distance between their centroids (or if the space is not Euclidean, between
their clustroids).
o Minimum distance between nodes in the clusters.
o Maximum distance between nodes in the clusters.
o Average distance between nodes of the clusters.
 End the merger process when we have \few enough" clusters. Possibilities:
o Use a k-means approach -- merge until only k clusters remain.
o Stop merging clusters when the only clusters that can result from merging
fail to meet some criterion of compactness, e.g., the average distance of
nodes to their clustroid or centroid is too high.
5.4.1 The GRGPF Algorithm
This algorithm [25, 38] assumes there is a distance measure D, but no Euclidean space. It
also assumes that there is too much data to fit in main memory. The data structure it uses
to store clusters is like an R-tree. Nodes of the tree are disk blocks, and we store different
things at leaf and interior nodes:
 In leaf blocks, we store cluster features that summarize a cluster in a manner
similar to BFR. However, since there is no Euclidean space, the \features are
somewhat different, as follows:
o The number of points in the cluster, N.
o The clustroid: that point in the cluster that minimizes the rowsum, i.e., the
sum of the squares of the distances to the other points of the cluster.
28


If C is a cluster, C’ will denote its clustroid.
Thus, the rowsum of the clustroid is  D(C ' , X ).
x
in
c

Notice that the rowsum of the clustroid is analogous to the statistic
SUMSQ that was used in BFR. However, SUMSQ is relative to
the origin of the Euclidean space, while GRGPF assumes no such
space. The rowsum can be used to compute a statistic, the radius of
the cluster that is analogous to the standard deviation of a cluster in
BFR. The formula is radius = rowsum / N
o The p points in the cluster that are closest to the clustroid and their rowsums,
for some chosen constant p.
o The p points in the cluster that are farthest from the clustroid.
 In interior nodes, we keep samples of the clustroids of the clusters represented by
the descendants of this tree node. An effort is made to keep the clusters in each
subtree close. As in an R-tree, the interior nodes thus inform about the
approximate region in which clusters at their descendants are found. When we
need to insert a point into some cluster, we start at the root and proceed down the
tree, choosing only those paths along which a reasonably close cluster might be
found, judging from the samples at each interior node.
5.4.2 CURE
The outlines of the CURE algorithm [38] are:
 Start with a main memory full of random points. Cluster these points using the
hierarchical approach.
 For each cluster, choose c “sample" points for some constant c. These points are
picked to be as dispersed as possible, then moved slightly closer to the mean, as
follows:
o Pick the first sample point to be the point of the cluster farthest from the
centroid.
o Repeatedly pick additional sample points by choosing that point of the
cluster whose minimum distance to an already chosen sample point is as
great as possible.
o When c sample points are chosen, move all the samples toward the
centroid by some fractional distance, e.g., 20% of the way toward the
centroid. As a result, the sample points need not be real points of the
cluster, but that fact is unimportant. The net effect is that the samples are
“typical" points, well dispersed around the cluster, no matter what the
cluster's shape is.
 Assign all points, including those involved in steps (1) and (2) to the nearest
cluster, where “nearest" means shortest distance to some sample point.
Example 5.5: For c=6, 6 sample points are picked from an elongated cluster, and then
moved 20% of the way toward the centroid.
29
3
2
5
+
6
1
4
6.
Conclusions
In this paper, we have reviewed various data mining techniques. The basic features that
most data mining techniques should consider are: all approaches deal with large amounts
of data, efficiency is required due to volume of data, and accuracy is an essential element
in the data mining process.
Many effective data mining techniques in association mining, classification and
clustering have been discussed. Parallelism is an important trend in all data mining tasks.
Also, it could show a promising reorganization of the potential benefit of involvement in
the data mining step to accelerate the whole KDD process and to improve the results.
Most parallel data mining techniques are derived from sequential data mining techniques.
The methodologies used in these techniques could lead to less and shorter iterations
within the knowledge discovery process loop. But, dealing with the data mining process
in the parallel and distributed environment should be through a new generation of data
mining techniques that are designed especially for the parallel and distributed
environment.
Another issue related to the nature of data should be considered. The process of choosing
the used data mining technique should depend on some parameters related to the
distribution and homogeneity of data. Some techniques could give excellent performance
with some data volumes, while the same techniques would give a poor performance with
other data volumes.
References
[1]
[2]
[3]
[4]
R. Agrawal, T. Imielinski and A.Swami, “Mining Association Rules between sets
of items in large databases,” Proc. ACM-SIGMOD Int. Conf. On Management of
Data, Washington, D.C. 1993.
R. Agrawal and J.C. Shafer, “Parallel Mining of Association Rules,” IEEE Trans.
On Knowledge and Data Eng., 8(6):962-969, December 1996.
R. Agrawal and R. Srikant, “Fast Algorithms for mining association rules,” Proc.
Of 20th VLDB Conference, 1994.
R. Agrawal and R. Srikant, "Mining Sequential Patterns", In Proc. 11th Intl. Conf.
On Data Engineering, Taipi, Taiwan, March 1995.
30
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
M. Ankerst, M. Ester and H.P. Kriegel, "Visual Classification: An Interactive
Approach to Decision Tree Construction", Proc. Int. Conf. on Knowledge
Discovery and Data Mining, pp. 392-397, 1999.
M. Ankerst, M. Ester, H.P. Kriegel, "Towards an Effective Cooperation of the
User and the Computer for Classification", Proc. Int. Conf. on Knowledge
Discovery and Data Mining, pp. 178-188, 2000.
M. Berry and G. Linoff, Data Mining Techniques (For Marketing, Sales, and
Customer Support), John Wiley & Sons, 1997.
P. Bollmann-Sdorra, A.M. Hafez and V.V. Raghavan, “A Theoretical Framework
for Association Mining based on the Boolean Retrieval Model,” DaWaK 2001,
September 2001.
R. Brachmann and T. Anand, "The Process of Knowledge Discovery in
Databases: A Human-Centered Approach," Advances in Knowledge Discovery
and Data Mining, AAAI Press, Menlo Park, CA, pp. 37-58.
L. Breiman, J.H. Friedman, R.A.Olshen, and C.J. Stone. Classification and
Regression Trees. Wadsworth, Belmont, 1984.
S. Brin et al., “Dynamic Itemset Counting and Implication Rules for Market
Basket Data,” Proc. ACM SIGMOD Conf. Management of Data, ACM Press,
New York, 1997, pp. 255–264.
W. Buntine, "A Guide To The Literature On Learning Probabilistic Networks
From Data." IEEE Transactions on Knowledge and Data Engineering 8, 2 (Apr.
1996), 195-210.
W. Buntine, Graphical Models For Discovering Knowledge. In Advances In
Knowledge Discovery And Data Mining, eds.
V. Cherkassky and F. Mulier, Learning from Data, John Wiley & Sons, 1998.
U. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurasamy (eds.), Advances
in Knowledge Discovery and Data Mining, AAAI Press/ The MIT Press, 1996.
U. Fayyad, G. Piatetsky-Shapiro and P. Smyth, "From Data Mining to Knowledge
Discovery: An Overview," Advances in Knowledge Discovery and Data Mining,
AAAI Press, Menlo Park, CA, pp.1-30.
U. Fayyad, G. Piatetsky-Shapiro and P. Smyth, "The KDD process for extracting
useful knowledge from volumes of data," Communications of the ACM, 39(1).
A. Freitas and S. Lavington, Mining Very Large Databases with Parallel
Processing, Kluwer Academic Publishers, 1998.
N. Friedman, D. Geiger and M. Goldszmidt, ‘Bayesian Network Classifiers,”
Machine Learning 29:131-163, 1997.
AM. Hafez, “A Dynamic Approach for Knowledge Discovery of Web Access
Patterns”, ISMIS 2000, pp. 130-138.
A.M. Hafez, “Association mining of dependency between time series,”
Proceedings of SPIE Vol. 4384, SPIE AeroSense, April 2001.
A.M. Hafez and V.V. Raghavan, "A Matrix Approach for Association Mining,"
the ISCA 10th International Conference on Intelligent Systems, June 13-15, 2001.
E.H. Han, G. Karypis and V. Kumar, “Scalable Parallel Data Mining for
Association Rules,” Proc. 1997 ACM-SIGMOD Int. Conf. On Management of
Data, Tucson, Arizona, 1997.
31
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
E.H. Han, G. Karypis, V.Kumar and B. Mobasher, “Clustering Based On
Association Rule Hypergraphs,” SIGMOD’97 Workshop on Research Issues on
Data Mining and Knowledge Discovery.
A.K. Jain and R.C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
M. Joshi, G. Karypis and V. Kumar, “ScalParC: A New Scalable and Efficient
Parallel Classification Algorithms for Mining Large Datasets,” Proc. 12th
International Parallel Processing Symposium (IPPS), Orlando, 1998.
M. Joshi, G. Karypis and V. Kumar, “Parallel Algorithms for Sequential
Associations: Issues and Challenges,” Minisymposium Talk at Ninth SIAM
International Conference on Parallel Processing (PP’99), San Antonio, 1999.
V. Kumar, A. Grama, A. Gupta and G.Karypis, Introduc-tion to Parallel
Computing: Algorithm Design and Analysis. Benjamin-Cummings/Addison
Wesley, Redwood City, CA, 1994.
M. Mehta, R. Agarwal and J. Rissanen, “SLIQ: A fast scalable classifier for data
mining,” In Proc. of 5th In-ternational Conference on Extending Database
Technology (EBDT), Avignon, France, March 1996.
D. Michie, D.J. Spiegelhalter and C. C. Taylor. Machine Learning, Neural and
Statistical Classification. Ellis Horwood, 1994.
D. Pyle, Data Preparation for Data Mining, Morgan Kaufmann, 1999.
J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993.
V.V. Raghavan and A.M. Hafez, “Dynamic Data Mining,“ IEA/AIE 2000,
pp.220-229, 2000.
A. Savasere, E. Omiecinski and S. Navathe, "An Efficient Algorithm for Mining
Association Rules in large Databases," Proc. 21st Int'l Conf. of Very Large Data
Bases, 1995
J. Shafer, R. Agarwal and M. Mehta, “SPRINT: A scalable parallel classifier for
data mining,” In Proc. of 22nd Interna-tional Conference on Very Large
Databases, Mumbai, India, September 1996.
A. Srivastava, E.H. Han, V. Kumar and V. Singh, “Parallel Formulations of
Decision-Tree Classification Algorithms,” Proc. 12th International Parallel
Processing Symposium (IPPS), Orlando, 1998.
R. Srikant and R. Agrawal, “Mining Sequential Patterns: Generalizations and
Performance Improvements,” Proc. of 5th Int. Conf. On Extending Database
Technology (EDBT), Avignon, France, 1996.
A.K. Tung, J. Han, L.V. Lakshmanan and R.T. Ng, "Constraint-based Clustering
in Large Databases," Proc. Int. Conf. on Database Theory, pp. 405-419, 2001.
K. Wang, Y. He and J. Han, "Mining Frequent Itemsets Using Support
Constraints," Proc. 26th Int. Conf. On very Large Data Bases, pp. 43-52, 2000.
M. Ware, E. Frank, G. Holmes, M. Hall and I.H. Witten, "Interactive Machine
Learning Letting Users Build Classifiers," http://www.cs.waikato.ac.nz/ ml/
publications.html
S. M. Weiss and N. Indurkhya, Predictive Data Mining (a practical guide),
Morgan Kaufmann Publishers,1998.
P.C. Wong, "Visual Data Mining," IEEE Computer Graphics and Applications,
Vol. 19(5), pp. 20-12, 1999.
32
Download