Knowledge Discovery in Databases Alaaeldin M. Hafez Department of Computer Science and Automatic Control Faculty of Engineering Alexandria University Abstract. Business information received from advanced data analysis and data mining is a critical success factor for companies wishing to maximize competitive advantage. The use of traditional tools and techniques to discover knowledge is ruthless and does not give the right information at the right time. Knowledge discovery is defined as ``the non-trivial extraction of implicit, unknown, and potentially useful information from data''. There are many knowledge discovery methodologies in use and under development. Some of these techniques are generic, while others are domain-specific. In this paper, we present a review outlining the state-ofthe-art techniques in knowledge discovery in database systems 1. Introduction Recent years have seen an enormous increase in the amount of information stored in electronic format. It has been estimated that the amount of collected information in the world doubles every 20 months and the size and number of databases are increasing even faster and the ability to rapidly collect data has outpaced the ability to analyze it. Information is crucial for decision making, especially in business operations. As a response to those trends, the term 'Data Mining' (or 'Knowledge Discovery') has been coined to describe a variety of techniques to identify nuggets of information or decisionmaking knowledge in bodies of data, and extracting these in such a way that they can be put to use in the areas such as decision support, prediction, forecasting and estimation. Automated tools must be developed to help extract meaningful information from a flood of information. Moreover, these tools must be sophisticated enough to search for correlations among the data unspecified by the user, as the potential for unforeseen relationships to exist among the data is very high. A successful tool set to accomplish these goals will locate useful nuggets of information in the otherwise chaotic data space, and present them to the user in a contextual format. An urgent need for creating a new generation of techniques is needed for automating data mining and knowledge discovery in databases (KDD). KDD is a broad area that integrates methods from several fields including statistics, databases, AI, machine learning, pattern recognition, machine discovery, uncertainty modeling, data visualization, high performance computing, optimization, management information systems (MIS), and knowledge-based systems. 1 The term “Knowledge discovery in databases” is defined as the process of identifying useful and novel structure (model) in data [1, 3, 4, 17, 31]. It could be viewed as a multistage process. Those stages are summarized as follows: Data gathering, e.g., databases, data warehouses, Web crawling. Data cleansing; eliminate errors, e.g., GPA = 7.3. Feature extraction; obtaining only the interesting attributes of data Data mining; discovering and extracting meaningful patterns. Visualization of data. Verification and evaluation of results; drawing conclusions. Data Gathering Databases Data Warehouses Data Cleansing Web Crawlers Feature Extraction Verification and evaluation Data Mining Visualization Data mining is considered as the main step in the knowledge discovery process that is concerned with the algorithms used to extract potentially valuable patterns, associations, trends, sequences and dependencies in data [1, 3, 4, 7, 22, 33, 34, 37, 39]. Key business examples include web site access analysis for improvements in e-commerce advertising, fraud detection, screening and investigation, retail site or product analysis, and customer segmentation. Data mining techniques can discover information that many traditional business analysis and statistical techniques fail to deliver. Additionally, the application of data mining techniques further exploits the value of data warehouse by converting expensive volumes of data into valuable assets for future tactical and strategic business development. Management information systems should provide advanced capabilities that give the user the power to ask more sophisticated and pertinent questions. It empowers the right people by providing the specific information they need. Data mining techniques could be categorized either by tasks or by methods. 1.1 Data mining tasks: Association Rule Discovery. 2 Classification. Clustering. Sequential Pattern Discovery. Regression. Deviation Detection Most researchers refer to the first three data mining tasks as the main data mining tasks. The rest are either related more to some other fields, such as regression, or considered as a sub-field in one of the main tasks, such as sequential pattern discovery and deviation detection. 1.2 Data Mining Methods: Algorithms for mining spatial, textual, and other complex data Incremental discovery methods and re-use of discovered knowledge Integration of discovery methods Data structures and query evaluation methods for data mining Parallel and distributed data mining techniques Issues and challenges for dealing with massive or small data sets Fundamental issues from statistics, databases, optimization, and information processing in general as they relate to problems of extracting patterns and models from data. Because of the limited space, we limit our discussion on the main techniques in data mining tasks. In section 2, we briefly discuss and give examples of the various data mining tasks. In sections 3, 4 and 5, we discuss the main features and techniques in association mining, classification and clustering, respectively. The paper is concluded in section 6. 2 Data Mining Tasks 2.1 Association Rule Discovery Given a set of records each of which contains some set of items from a given collection of items. Produce dependency rules that predict occurrences of one item based on the occurrences of some other items [1, 3,27, 28]. Transaction ID 1 2 3 4 5 6 Items Milk, Bread, Diaper Milk, Bread Coke, Milk, Bread, Diaper Coke, Coffee, Bread Bread, Coke Milk, Coke, Coffee 3 Some of the Discovered Rules are: Bread Milk Milk, Bread Diaper Example 2.1: (Supermarket shelf management) Identify items that are bought together by sufficiently many customers. Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items. Stack those items mostly likely to buy together next to each other. Example 2.2: (Inventory Management) A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households. Approach: 2.2 Process the data on tools and parts required in previous repairs at different consumer locations and discover the co-occurrence patterns. Classification Given a collection of records (training set), each record contains a set of attributes [10, 12, 30, 32, 35], one of the attributes is the class (Classifier), find a model for class attribute as a function of the values of other attributes. previously unseen records should be assigned a class as accurately as possible. a test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Test Set Training Set Classifier Model 4 Example 2.3: (Marketing) Reduce cost of mailing by targeting a set of consumers likely to buy a new car. Approach: Use the data for a similar product introduced before. Define the class attribute, which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute. For all such customers, collect various lifestyle, demographic, and business related information such as salary, type of business, city, etc. Use this information as input attributes to learn a classifier model. Example 2.4: (Fraud Detection) Predict fraudulent cases in credit card transactions. Approach: Use credit card transactions and the information on its account-holder as attributes, such as, when does a customer buy, what does he buy, how often he pays on time, etc. Label past transactions as fraud or fair transactions. This forms the class attribute. Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card transactions on an account. 2.3 Clustering Given a set of data points, each having a set of attributes, and a similarity measure among them [24, 25, 28, 41]. Find clusters such that Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Where similarity measures are Euclidean Distance if attributes are continuous. Other Problem-specific Measures. 5 Example 2.5: (Market Segmentation) Subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach: Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. Example 2.6: (Document Clustering) Find groups of documents that are similar to each other based on the important terms appearing in them. Approach: Identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents. 2.4 Sequential Pattern Discovery Given a set of event sequence, find rules that predict strong sequential dependencies among different events [4, 20, 37]. Rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing constraints. Example 2.7: (Telecommunications Alarm) In telecommunication alarm logs, Rectifier_Alarm Fire_Alarm 2.5 Regression Regression techniques are important in statistics and neural network fields [10]. Assuming a linear or nonlinear model of dependency, they are used in predicting a value of a given continuous valued variable based on the values of other variables. 6 Example 2.8: Predicting sales amounts of new product based on advertising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices. 2.6 Deviation Detection Deviation detection deals with discovering the most significant changes that are often infrequent, in data from previously measured values. Example 2.9: Outlier detection in statistics. 3. Association Rule Discovery Association rule discovery or association mining that discovers dependencies among values of an attribute was introduced by Agrawal et al.[1] and has emerged as a prominent research area. The association mining problem also referred to as the market basket problem can be formally defined as follows. Let I = {i1,i2, . . . , in} be a set of items as S = {s1, s2, . . ., sm} be a set of transactions, where each transaction si S is a set of items that is si I. An association rule denoted by X Y, where X,Y I and X Y = , describes the existence of a relationship between the two itemsets X and Y. Several measures have been introduced to define the strength of the relationship between itemsets X and Y such as support, confidence, and interest. The definitions of these measures, from a probabilistic model are given below. Support (X Y ) P( X , Y ) ; the percentage of transactions in the database that contain both X and Y. Confidence(X Y ) P( X , Y ) / P( X ) ; the percentage of transactions containing Y in transactions those contain X. Interest(X Y ) P( X , Y ) / P( X ) P(Y ) , represents a test of statistical independence. The problem of finding all association rules that have support and confidence greater than some user-specified minimum support and confidence out of database D, could be viewed as Find all sets of items (itemsets) that have transaction support above minimum support. The support for an itemset is the number of transactions that contain the itemset. Itemsets with minimum support are called large itemsets. An itemset of size k is a k-itemset. 7 Use the large itemsets to generate the desired association rules with minimum confidence. Most association mining algorithms concentrate on finding large itemsets, and consider generating association rules as a straightforward procedure. 3.1 Spectrum of Association Mining Techniques Many algorithms [1, 2, 3, 5, 7, 11, 22, 23, 33, 34, 37, 39], have been proposed to generate association rules that satisfy certain measures. A close examination of those algorithms reveals that the spectrum of techniques that generate association rules, has two extremes: A transaction data file is repeatedly scanned to generate large itemsets. The scanning process stops when there are no more itemsets to be generated. A transaction data file is scanned only once to build a complete transaction lattice. Each node on the lattice represents a possible large itemset. A count is attached to each node to reflect the frequency of itemsets represented by nodes. In the first case, since the transaction data file is traversed many times, the cost of generating large itemsets is high. In the later case, while the transaction data file is traversed only once, the maximum number of nodes in the transaction lattice is 2n , n is the cardinality of I, the set of items. Maintaining such a structure is expensive. In association mining, the subset relation of item-sets. Also, the subset relation itemsets. In other words, if B is a frequent itemset, then all subsets A B are also frequent. Association mining algorithms are different in the way they search the itemset lattice spanned by the subset relation. Most approaches use a level-wise or bottom-up search of the lattice to enumerate the frequent itemsets. If long frequent itemsets are expected, a pure top-down approach might be preferred. Some have proposed a hybrid search, which combines top-down and bottom-up approaches. 3.2 New Candidates Generation Association mining algorithms can differ in the way they generate new candidates. A complete search, the dominant approach, guarantees that we can generate and test all frequent subsets. Here, complete doesn’t mean exhaustive; we can use pruning to eliminate useless branches in the search space. Heuristic generation sacrifices completeness for the sake of speed. At each step, it only examines a limited number of “good” branches. Random search to locate the maximal frequent itemsets is also possible. Methods that can be used here include genetic algorithms and simulated annealing. Because of a strong emphasis on completeness, association mining literature has not given much attention to the last two methods. Also, association mining algorithms differ depending on whether they generate all frequent subsets or only the maximal ones. Identifying the maximal itemsets is the core 8 task, because an additional database scan can generate all other subsets. Nevertheless, the majority of algorithms list all frequent itemsets. 3.3 Association Mining Algorithms 3.3.1 The Apriori Algorithm The Apriori algorithm [1] is considered as the most famous algorithms in the area of association mining. Apriori starts with a seed set of itemsets found to be large in the previous pass, and uses it to generate new potentially large itemsets (called candidate itemsets). The actual support for these candidates is counted during the pass over the data, and non-large candidates are thrown out. The main outlines of the Apriori algorithm are described below. Apriori algorithm: First pass: counts item occurrences to determine the large 1-itemsets. Second and subsequent passes: for (k=2; Lk-1 not empty; k++) Ck = apriori-gen(Lk-1); // (New candidates) forall transactions t in D do Ct = subset(Ck,t) // (Candidates contained in t) forall candidates c in Ct do o c.count++ Lk = {c in Ck | c.count >= minsupport } Answer = Unionk(Lk) where Lk: Set of large k-itemsets (i.e. those with minimum support). Each member of set has an itemset and a support count. Ck: Set of candidate k-itemsets (potentially large). It has itemset and support count. apriori-gen accepts the set of all large (k-1)-itemsets Lk-1, and returns a superset of the set of all large k-itemsets. First, it performs a join of Lk-1 to itself to generate Ck, and then prunes all itemsets from Ck such that some (k-1)-subset of that itemset is not in Lk-1. In Apriori algorithm, the transaction data file is repeatedly scanned to generate large itemsets. The scanning process stops when there are no more itemsets to be generated. For large databases, the Apriori algorithm requires extensive access of secondary storage that can become a bottleneck for efficient processing. Several Apriori based algorithms have been introduced to overcome the excessive use of I/O devices. 9 3.3.2 Apriori Based Algorithms The AprioriTid algorithm[4] is a variation of the Apriori algorithm. The AprioriTid algorithm also uses the "apriori-gen" function to determine the candidate itemsets before the pass begins. The main difference from the Apriori algorithm is that the AprioriTid algorithm does not use the database for counting support after the first pass. Instead, the set <TID, {Xk}> is used for counting. (Each Xk is a potentially large k-itemset in the transaction with identifier TID.) The benefit of using this scheme for counting support is that at each pass other than the first pass, the scanning of the entire database is avoided. But the downside of this is that the set <TID, {Xk}> that would have been generated at each pass may be huge. Another algorithm, called AprioriHybrid, is introduced in [3]. The basic idea of the AprioriHybird algorithm is to run the Apriori algorithm initially, and then switch to the AprioriTid algorithm when the generated database (i.e. <TID, {Xk}>) would fit in the memory. 3.3.3. The Partition Algorithm In [34], Savasere et al. introduced the Partition algorithm. The Partition algorithm logically partitions the database D into n partitions, and only reads the entire database at most two times to generate the association rules. The reason for using the partition scheme is that any potential large itemset would appear as a large itemset in at least one of the partitions. The algorithm consists of two phases. In the first phase, the algorithm iterates n times, and during each iteration, only one partition is considered. At any given iteration, the function "gen_large_itemsets" takes a single partition and generates local large itemsets of all lengths from this partition. All of these local large itemsets of the same lengths in all n partitions are merged and then combined to generate the global candidate itemsets. In the second phase, the algorithm counts the support of each global candidate itemsets and generates the global large itemsets. Note that the database is read twice during the process: once in the first phase and the other in the second phase, which the support counting requires a scan of the entire database. By taking minimal number of passes through the entire database drastically saves the time used for doing I/O. 3.3.4. The Dynamic Itemset Counting Algorithm In the Dynamic Itemset Counting (DIC) algorithm [11], the database is divided into p equal-sized partitions so that each partition fits in memory. For partition 1, DIC gathers the supports of single items. Items found to be locally frequent (only in this partition) generate candidate 2-itemsets. Then DIC reads partition 2 and obtains supports for all current candidates—that is, the single items and the candidate 2-itemsets. This process repeats for the remaining partitions. DIC starts counting candidate k-itemsets while processing partition k in the first database scan. After the last partition p has been processed, the processing wraps around to partition 1 again. A candidate’s global support is known once the processing wraps around the database and reaches the partition where it was first generated. 10 DIC is effective in reducing the number of database scans if most partitions are homogeneous (have similar frequent itemset distributions). If data is not homogeneous, DIC might generate many false positives (itemsets that are locally frequent but not globally frequent) and scan the database more than Apriori does. DIC proposes a random partitioning technique to reduce the data partition skew. 3.4 New Trends in Association Mining Many knowledge discovery applications, such as on-line services and world wide web, are dynamic and require accurate mining information from data that changes on a regular basis. In world wide web, every day hundreds of remote sites are created and removed. In such an environment, frequent or occasional updates may change the status of some rules discovered earlier. Performance is the main consideration in data mining. Discovering knowledge is an expensive operation. It requires extensive access of secondary storage that can become a bottleneck for efficient processing. Two solutions have been proposed to improve the performance of the data mining process, Dynamic data mining Parallel data mining 3.4.1 Dynamic Data Mining Using previously discovered knowledge along with new data updates to maintain discovered knowledge could solve many problems, that have faced data mining techniques; that is, database updates, accuracy of data mining results, gaining more knowledge and interpretation of the results, and performance. In [33], a dynamic approach that dynamically updates knowledge obtained from the previous data mining process was introduced. Transactions over a long duration are divided into a set of consecutive episodes. Information gained during the current episode depends on the current set of transactions and the discovered information during the last episode. The approach discovers current data mining rules by using updates that have occurred during the current episode along with the data mining rules that have been discovered in the previous episode. SUPPORT for an itemset S is calculated as SUPPORT ( S ) F ( S ) F where F(S) is the number of transactions having S, and F is the total number of transactions. For a minimum SUPPORT value MINSUP, S is a large (or frequent) itemset if SUPPORT(S) MINSUP, or F(S) F*MINSUP. 11 Suppose we have divided the transaction set T into two subsets T1 and T2, corresponding to two consecutive time intervals, where F1 is the number of transactions in T1 and F2 is the number of transactions in T2, (F=F1+F2), and F1(S) is the number of transactions having S in T1 and F2(S) is the number of transactions having S in T2, (F(S)=F1(S)+F2(S)). By calculating the SUPPORT of S, in each of the two subsets, we get F (S ) F ( S ) and SUPPORT2 ( S ) 2 SUPPORT1 ( S ) 1 F2 F1 S is a large itemset if F1 ( S ) F2 ( S ) MINSUP F1 F2 , or F1 ( S ) F2 ( S ) ( F1 F2 )* MINSUP In order to find out if S is a large itemset or not, we consider four cases, S is a large itemset in T1 and also a large itemset in T2, i.e., F1 ( S ) F1 * MINSUP and F2 ( S ) F2 * MINSUP . S is a large itemset in T1 but a small itemset in T2, i.e., F1 ( S ) F1 * MINSUP and F2 ( S ) F2 * MINSUP . S is a small itemset in T1 but a large itemset in T2, i.e., F1 ( S ) F1 * MINSUP and F2 ( S ) F2 * min sup . S is a small itemset in T1 and also a small itemset in T2, i.e., F1 ( S ) F1 * MINSUP and F2(S)< F2*MINSUP. In the first and fourth cases, S is a large itemset and a small itemset in transaction set T, respectively, while in the second and third cases, it is not clear to determine if S is a small itemset or a large itemset. Formally speaking, let SUPPORT(S) = MINSUP + , where 0 if S is a large itemset, and 0 if S is a small itemset. The above four cases have the following characteristics, 1 0 and 2 0 1 0 and 2 0 1 0 and 2 0 1 0 and 2 0 S is a large itemset if F1 * ( MINSUP 1 ) F2 * ( MINSUP 2 ) MINSUP , or F1 F2 F1 * ( MINSUP 1 ) F2 * ( MINSUP 2 ) MINSUP * ( F1 F2 ) which can be written as F1 * 1 F2 * 2 0 Generally, let the transaction set T be divided into n transaction subsets Ti 's, 1 i n. S n is a large itemset if Fi * i 0 , where Fi is the number of transactions in Ti and i = i 1 SUPPORTi(S) - MINSUP, 1 i n. -MINSUP i 1-MINSUP, 1 i n. n For those cases where i 1 Fi * i 0 , there are two options, either 12 discard S as a large itemset (a small itemset with no history record maintained), or keep it for future calculations (a small itemset with history record maintained). In this case, we are not going to report it as a large itemset, but its formula F * n i 1 i i will be maintained and checked through the future intervals. 3.4.2. Parallel Association Mining algorithms Association mining is computationally and I/O intensive [2, 18]. Data is huge in terms of number of items and number of transactions, one of the main features needed in association ming is scalability. For large databases, sequential algorithms cannot provide scalability, in terms of the data size or runtime performance. Therefore, we must rely on high-performance parallel computing. In the literature, most parallel association mining algorithms [2, 18, 21] are based on their sequential counterparts. Researchers expect parallelism to relieve current association mining methods from the sequential bottleneck, providing scalability to massive data sets and improving response time. The main challenges include synchronization and communication minimization, workload balancing, finding good data layout and data decomposition, and disk I/O minimization. The parallel design space spans three main components: Memory Systems Type of Parallelism Load Balancing 3.4.2.1 Memory Systems Three techniques for using multiple processors have been considered [2, 18, 21, 27, 28]. Shared memory (where all processors access common memory). In shared memory architecture, many desirable properties could be achieved. Each processor has direct and equal access to all the system’s memory. Parallel programs are easy to implement on such a system. Although shared memory architecture offers programming simplicity, a common bus’s finite bandwidth can limit scalability. Distributed memory (where each processor has a private memory). In distributed memory architecture, each processor has its own local memory, which only that processor can access directly. For a processor to access data in the local memory of another processor, message passing must send a copy of the desired data elements from one processor to the other. A distributed-memory, message-passing architecture cures the scalability problem by eliminating the bus, but at the expense of programming simplicity. Combine the best of the distributed and shared memory approaches. The physical memory is distributed among the nodes but a shared global address space on each 13 processor is provided. Locally cached data always reflects any processor’s latest modification. 3.4.2.2 Type of Parallelism Task and data parallelism are the two main paradigms for exploiting algorithm parallelism. Data parallelism corresponds to the case where the database is partitioned among P processors— logically partitioned for shared memory architecture, physically for distributed memory architecture. Each processor works on its local partition of the database but performs the same computation of counting support for the global candidate itemsets. Task parallelism corresponds to the case where the processors perform different computations independently, such as counting a disjoint set of candidates, but have or need access to the entire database. In shard memory architecture, processors have access to the entire data, but for distributed memory architecture, the process of accessing the database can involve selective replication or explicit communication of the local portions. Hybrid parallelism, which combines both task and data parallelism, is also possible and perhaps desirable for exploiting all available parallelism in association mining methods. 3.4.2.3 Load Balancing Two approaches are used in load balancing, static load balancing, and dynamic load balancing In static load balancing, a heuristic cost function initially partitions work among the processors. Subsequent data or computation movement is not available to correct load imbalances. Dynamic load balancing takes work from heavily loaded processors and reassigning it to lightly loaded ones. Dynamic load balancing requires additional costs for work and data movement, and also for the mechanism used to detect whether there is an imbalance. However, dynamic load balancing is essential if there is a large load imbalance or if the load changes with time. Dynamic load balancing is especially important in multi-user environments with transient loads and in heterogeneous platforms, which have different processor and network speeds. These kinds of environments include parallel servers and heterogeneous clusters, meta-clusters, and super-clusters. All extant association mining algorithms use only static load balancing that is inherent in the initial partitioning of the database among available nodes. This is because they assume a dedicated, homogeneous environment. 14 4. Classification Classification is an important problem in the rapidly emerging field of data mining [10, 19, 30, 32]. The problem can be stated as follows. We are given a training dataset consisting of records. Each record is identified by a unique record id and consists of fields corresponding to the attributes. An attribute with a continuous domain is called a continuous attribute. An attribute with finite domain of discrete values is called a categorical attribute. One of the categorical at-tributes is the classifying attribute or class and the values in its domain are called class labels. Classification is the process of discovering a model for the class in terms of the remaining attributes. 4.1 Serial Algorithms for Classification Serial Algorithms for classification [10, 12, 19, 29, 30] are categorized as decision tree based methods and non-decision tree based methods. Non-decision tree based methods include neural networks, genetic algorithms, and Bayesian networks. 4.1.1. Decision Tree Based Classification The decision tree models [5] are found to be most useful in the domain of data mining. They yield comparable or better accuracy as compared to other models such as neural networks, statistical models or genetic models [30]. Many advantages could be considered such as o o o o Inexpensive to construct Easy to Interpret Easy to integrate with database systems Comparable or better accuracy in many applications Example 4.1: ID 1 2 3 4 5 6 7 8 9 10 Income 95K 120K 140K 80K 160K 100K 90K 75K 170K 125K Marital Status Married Single Single Married Divorced Married Single Divorced Divorced Single Refund Yes Yes No No Yes Yes Yes No No No Cheat NO NO YES NO NO NO NO NO YES YES 15 Refund Yes No NO Marital Status Single/Divorced Married NO Income >80K <=80K YES NO Many algorithms have been used in constructing decision trees. Hunt’s algorithm is one of the earliest versions that have been used in building decision trees. Many other algorithms such as, CART [10], ID3, C4.5 [32], SLIQ [29], and SPRINT [35], have followed. Two phases are needed for the structure of the classification tree, o o Tree Induction Tree Pruning 4.1.1.1 Decision Tree Induction For tree induction, records are split based on an attribute that optimizes the splitting criterion. Splitting could be either on a categorical attribute or on a continuous attribute. For categorical attributes, each partition has a subset of values signifying it. Two methods are use to form partition. In the simple method, it uses as many partitions as distinct values, while in the complex method, two partitions are used, and values are divided into two subsets; one subset for each partition. Marital Status Single Divorced Married Simple Method 16 Marital Status {Married} {Single, Divorced} Complex Method For continuous attributes, splitting is done either by using the static approach, where Apriori discretization is used to form a categorical attribute, or by using the dynamic approach, where decisions are made as algorithm proceeds. The dynamic approach is complex but more powerful and flexible in approximating true dependency. Dynamic decisions are made as follows: Form binary decisions based on one value; two partitions: A < v and A >= v Find ranges of values and use each range to partition. o Ranges can be found by a simple-minded bucketing, or by more intelligent clustering techniques Example: Salary in [0,15K), [15K, 60K), [60K, 100K), [100K,...] Find a linear combination of multiple variables, and make binary decisions or range-decisions based on it. Different splitting criterions are used. CART, SLIQ and SPRINT use GINI index as a splitting value. For n classes, the value of GINI index at node t is calculated as n GINI (t ) 1 ( p( j / t )) 2 j 1 where p(j/t) is the relative frequency of class j at node t. For node t, GINI index value is maximum (i.e., 1-1/n) when records are equally distributed among all classes, i.e., least interesting information. GINI index value is minimum (i.e., 0) when all records belong to one class, i.e., most interesting information. 17 Example 4.2: Count (Class1) = 0 Count (Class2) = 8 GINI = 0 Count (Class1) = 1 Count (Class2) = 7 GINI = 0.2187 Count (Class1) = 2 Count (Class2) = 6 GINI = 0.375 Count (Class1) = 3 Count (Class2) = 5 GINI = 0.4687 Count (Class1) = 4 Count (Class2) = 4 GINI = 0.5 Usually, when the GINI index is used, the splitting criterion is to minimize the GINI index of the split. When a node e is split into k partitions, the quality of the split is computed as k GINI split i 1 ni GINI (i ) n n is the number of records at node e, and ni is the number of records at node (partition) i. To compute the GINI index, the following steps are used. Use Binary Decisions based on one value. Choose the splitting value, e.g., number of possible splitting values = number of distinct values. Each splitting value has a count matrix associated with it, i.e., class counts in each of the partitions, A < v and A >= v Choose best v, for each v, scan the database to gather count matrix and compute its Gini index. This could be computationally inefficient (repetition of work.) For efficient computation: for each attribute, o Sort the attribute on values/ o Linearly scan these values, each time updating the count matrix and computing gini index. o Choose the split position that has the least GINI index. In C4.5 and ID3, another splitting criterion based on an information or entropy measure (INFO). INFO based computations are similar to GINI index computations. For n classes, the value of INFO at node t is calculated as n INFO(t ) p( j / t ) log( p( j / t )) j 1 where p(j/t) is the relative frequency of class j at node t. For node t, INFO value is maximum (i.e., log (n)) when records are equally distributed among all classes, i.e., least interesting information. 18 INFO value is minimum (i.e., 0) when all records belong to one class, i.e., most interesting information. Example 4.3: Count (Class1) = 0 Count (Class2) = 8 INFO = 0 Count (Class1) = 1 Count (Class2) = 7 INFO = 0.16363 Count (Class1) = 2 Count (Class2) = 6 INFO = 0.24422 Count (Class1) = 3 Count (Class2) = 5 INFO = 0.28731 Count (Class1) = 4 Count (Class2) = 4 INFO = 0.30103 When a node e is split into k partitions, the information gain is computed as k GAIN split INFO(e) i 1 ni INFO(i ) n n is the number of records at node e, and ni is the number of records at node (partition) i. The splitting criterion used with INFO is to choose the split that achieves the most reduction (maximize GAIN). The decision tree based classifiers that handle large datasets improve the classification accuracy. Some proposed classifiers SLIQ and SPRINT use entire dataset for classification and are shown to be more accurate as compared to the classifiers that use sampled dataset or multiple partitions of the dataset. The decision tree model is built by recursively splitting the training set based on a locally optimal criterion until all or most of the records belonging to each of the partitions bear the same class label. Briefly, there are two phases to this process at each node of the decision tree. First phase determines the splitting decision and second phase splits the data. The very difference in the nature of continuous and categorical attributes requires them to be handled in different manners. The handling of categorical attributes in both phases is straightforward. Handling the continuous attributes is challenging. An efficient determination of the splitting decision used in most of the existing classifiers requires these attributes to be sorted on values. The classifiers such as CART and C4.5 perform sorting at every node of the decision tree, which makes them very expensive for large datasets, since this sorting has to be done out-of-core. The approach taken by SLIQ and SPRINT sorts the continuous attributes only once in the beginning. The splitting phase maintains this sorted order without requiring to sort the records again. The attribute lists are split in a consistent manner using a mapping between a record identifier and the node to which it belongs after splitting. SPRINT implements this mapping as a hash table, which is built on-the-fly for every node of the decision tree. The size of this hash table is proportional to the number of records at the node. For the upper levels of the tree, this number is O(N),where N is the number of records in the training set. If the hash table does not fit in the main memory, then SPRINT has to divide the splitting phase into several stages such that the hash table for each of the phases fits in the memory. This requires multiple passes over each of the attribute lists causing expensive disk I/O. In the following table, we summarize the features and disadvantages of the some of the decision tree based classifiers. 19 Classification Technique C4.5 SLIQ Features Simple depth-first construction. Sorts Continuous Attributes at each node. Needs entire data to fit in memory. Unsuitable for Large Datasets; Needs out-of-core sorting. Classification Accuracy shown to improve when entire datasets are used! The arrays of the continuous attributes are pre-sorted. The classification tree is grown in a breadth-first fashion. Class List structure maintains the record-id to node mapping. Split determining Phase: Class List is referred to for computing the best split for each individual attribute. Computations for all nodes are clubbed for efficiency purposes (hence, breadth-first strategy). Splitting Phase: The list of this splitting attribute is used to update the leaf labels in class list. (no physical splitting of attribute lists among nodes) Class List is frequently and randomly accessed in both the phases of tree induction. So, it is required to be in-memory all the time for efficient performance. This limits the size of largest training set. SPRINT Disadvantages The arrays of the continuous attributes are pre-sorted. The sorted order is maintained during each split. The classification tree is grown in a breadth-first fashion. Class information is clubbed with each attribute list. Attribute lists are physically split among nodes. Split determining phase is just a linear scan of lists at each node. Hashing scheme used in splitting phase, tids of the splitting attribute are hashed with the tree node as the key, and lookup table, remaining attribute arrays are split by querying this hash structure. Size of hash table is O(N) for top levels of the tree. If hash table does not fit in memory (mostly true for large datasets), then build in parts so that each part fits. Multiple expensive I/O passes over the entire dataset. 4.1.1.2 Decision Tree Pruning Generalize the tree by removing statistical dependence on a particular training set. The subtree with least estimated error rate is chosen leading to compact and accurate representation o Separate sets for training and pruning. o Same dataset for tree building and tree pruning. Cross-validation (build tree on each subset and prune it using the remaining). Not suitable for large datasets. All training samples for pruning – Minimum Description Length (MDL) based pruning 20 MDL Based Tree Pruning Cost(Model,Data) = Cost(Data|Model) + Cost(Model) o Cost is the number of bits needed for encoding. o Search for a least cost model. Encode data in terms of number of classification errors. Encode tree (model) using node encoding (number of children) plus splitting condition encoding. A node is pruned to have less number of children if removing some or all children gives smaller Cost. 4.1.2 Non-Decision Tree Based Classification 4.1.2.1 Neural Networks A neural network [30] is an analytic technique that mimics the working of the human brain. Such a network consists of nodes (the neurons) and connections with a certain "weight". By feeding the network with a training data set it adjusts its weights to achieve sufficiently accurate predictions for expected future data sets. The advantage of neural networks over decision-trees is the speed at which it can operate. This increase in speed can be obtained by using the inherent parallel computations of the network. A major disadvantage of the neural network is the inability to backtrack the decision making process; a decision is calculated on the result of the initialisation of the network from the training set. This is opposed to decision-tree methods where the path leading to a decision is easily backtracked. Input 1 Input 2 Output (Class) Input 3 Input 4 Input 5 Hidden Layer 21 4.1.2.2 Bayesian Classifiers The Bayesian Approach is a graphical model that uses directed arcs exclusively to form a directed acyclic graph'' [13]. Although the Bayesian approach uses probabilities and a graphical means of representation, it is also considered a type of classification. Bayesian networks are typically used when the uncertainty associated with an outcome can be expressed in terms of a probability. This approach relies on encoded domain knowledge and has been used for diagnostic systems. The basic features of Bayesian networks are 4.2 Each attribute and class label are random variables. Objective is to classify a given record of attributes (A1, A2, …, An) to class C s.t. P(C | A1, A2, …, An) is maximal. Naive Bayesian Approach: o Assume independence among attributes Ai. o Estimate P(Ai | Cj) for all Ai and Cj. o New point is classified to Cj if P(Cj) Pi P(Ai| Cj) is maximal. Generic Approach based on Bayesian Networks: o Represent dependencies using a direct acyclic graph (child conditioned on all its parents). Class variable is a child of all the attributes. o Goal is to get compact and accurate representation of the joint probability distribution of all variables. Learning Bayesian Networks is an active research area. Parallel Formulations of Classification Algorithms The memory limitations faced by serial classifiers and the need of classifying much larger datasets in shorter times make the classification algorithm an ideal candidate for parallelization. The parallel formulation, however, must address the issues of efficiency and scalability in both memory requirements and parallel runtime [18, 26, 28, 36]. In constructing a parallel classifier, two partitioning philosophies should be considered, Partitioning of data only, where large number of classification tree nodes gives high communication cost. Partitioning of classification tree nodes which satisfies the natural concurrency, but o it loads imbalance as the amount of work associated with each node varies, o child nodes use the same data as used by parent node which leads to loss of locality and high data movement cost 22 Categorical attributes algorithms are classified into three different classes. The features of these three classes are summarized below. Class Synchronous Tree Construction Approach Features no data movement required high communication cost as tree becomes bushy. Partitioned Tree Construction Approach Hybrid Algorithm processors work independently once partitioned completely. load imbalance and high cost of data movement. combines good features of two approaches. adapts dynamically according to the size and shape of trees. For continuous attributes, three different approaches are used, Sort continuous attributes at each node of the tree, as in C4.5. Discretize continuous attributes, as in SPEC [36] (Srivastava, Han, Kumar, and Singh, 1997) . Use a pre-sorted list for each continuous attributes as in SPRINT [35](Shafer, Agrawal, and Mehta, VLDB’96), and ScalParC [26](Joshi, Karypis, and Kumar, IPPS’98). Many parallel formulations of decision tree based classifiers have been developed [28, 35, 36]. Among these, the most relevant one is the parallel formulation of SPRINT, as it requires sorting of continuous attributes only once. SPRINT’s design allows it to parallelize the split determining phase effectively. The parallel formulation proposed for the splitting phase, however, is inherently unscalable in both memory requirements and runtime. It builds the required hash table on all processors by gathering the record_id-tonode mapping from all the processors. For this phase, the communication overhead per processor is O(N), where N is the number of records in the training set. Apart from the initial sorting phase, the serial runtime of a classifier is O(N). Hence, SPRINT is unscalable in run-time. It is un-scalable in memory requirements also, because the memory requirement per processor is O(N), as the size of the hash table is of the same order as the size of the training dataset for the upper levels of the decision tree, and it resides on every processor. 5 Clustering Clustering is grouping some points in some space into a small number of clusters [18, 24, 38, 41], each cluster consisting of points that are “near" in some sense. 23 Example 5.1: Documents may be thought of as points in a high-dimensional space, where each dimension corresponds to one possible word. The position of a document in a dimension is the number of times the word occurs in the document (or just 1 if it occurs, 0 if not). Clusters of documents in this space often correspond to groups of documents on the same topic. Example 5.2: Many years ago, during a cholera outbreak in London, a physician plotted the location of cases on a map. The data indicated that cases clustered around certain intersections, where there were polluted wells, not only exposing the cause of cholera, but indicating what to do about the problem. 5.1 Distance Measures To consider whether a set of points is close enough to be considered a cluster, we need a distance measure D(x, y) to tell us how far points x and y are. The usual axioms for a distance measure D are: D(x, x) = 0, a point is distance 0 from itself. D(x, y) =D(y, x), D is symmetric. D(x; y) D(x; z) +D(z; y), the triangle inequality. Often, our points may be thought to live in a k-dimensional Euclidean space, and the distance between any two points, say x = [x1; x2; : : :; xk] and y = [y1; y2; : : :; yk] is given in one of the usual manners: Common distance (“L2 norm"): k (x i 1 Manhattan distance (“L1 norm"): i yi ) 2 i yi | k | x i 1 Max of dimensions (“L norm"): max ik1 | xi y i | In some cases, when there is no Euclidean space to place the points, we need to have other forms of distance measures. Example 5.3: In DNA sequences, two sequences may be similar even though there are some insertions and deletions as well as changes in some characters. For the two sequences, abcde and bcdxye (they don't have any positions in common, and don't even have the same length), we can define the distance function D(x,y) = |x| + |y| - 2 |LCS(x,y)|, where LCS stands for the longest common subsequence of x and y. In our example, LCS(abcde; bcdxye) is bcde, of length 4, so D(abcde, bcdxye) =5+6- 2X 4 =3; i.e., the strings are fairly close. 5.2 Approaches to Clustering Clustering algorithms [24, 25, 38] are divided into two broad classes: 24 Centroid approaches. The centroid or central point of each cluster is estimated, and points are assigned to the cluster of their nearest centroid. Hierarchical approaches. Starting with each point as a cluster by itself, nearby clusters are repeatedly merged, The clustering algorithms are classified according to: whether or not they assume a Euclidean distance, and whether they use a centroid or hierarchical approach. The following three algorithms are examples of the above classification. BFR: Centroid based; assumes Euclidean measure, with clusters formed by a Gaussian process in each dimension around the centroid. GRGPF: Centroid-based, but uses only a distance measure, not a Euclidean space. CURE: Hierarchical and Euclidean, this algorithm deals with odd-shaped clusters. In the following subsections, we give some of the popular clustering algorithms. 5.3 Centroid Clustering 5.3.1 The k-Means Algorithm The k-Means algorithm [24, 25, 38] is a popular main-memory algorithm. k cluster centroids are picked and points are assigned to the clusters by picking the closest centroid to the point in question. As points are assigned to clusters, the centroid of the cluster may migrate. Example 5.4: In the following figure, we have five points in the 2-dimensional space. For k=2, points 1 and 2 are assigned to the two clusters, and become their centroids for the moment. 5 c 1 3 b 2 a 4 For point 3, suppose it is closer to 2, so 3 joins the cluster of 2, whose centroid moves to the point indicated as a. Suppose that point 4 is closer to 1 than to a, so 4 joins 1 in its cluster, whose center moves to b. Finally, 5 is closer to a than b, so it joins the cluster { 2, 3}, whose centroid moves to c. 25 5.3.2 The BFR Algorithm Based on k-means, this algorithm [38] reads its data once, consuming a main-memoryfull at a time. The algorithm works best if the clusters are normally distributed around a central point, perhaps with a different standard deviation in each dimension. Figure 17 suggests what the data belonging to a typical cluster in two-dimensions might look like. A centroid, marked by +, has points scattered around, with the standard deviation in the horizontal dimension being twice what it is in the vertical dimension. About 70% of the points will lie within the 1 ellipse; 95% will lie within 2, 99.9% within 3, and 99.9999% within 4. 1 2 + A cluster consists of: A central core, the Discard set (DS). This set of points is considered certain to belong to the cluster. All the points in this set are replaced by some simple statistics, described below. Although called “discarded" points, these points in truth have a significant effect throughout the running of the algorithm, since they determine collectively where the centroid is and what the standard deviation of the cluster is in each dimension. Surrounding sub-clusters, the Compression set (CS). Each sub-cluster in the CS consists of a group of points that are sufficiently close to each other that they can be replaced by their statistics, just like the DS for a cluster is. However, they are sufficiently far away from any cluster's centroid, that we are not yet sure which cluster they belong to. Individual points that are not part of a cluster or sub-cluster, the Retained set (RS). These points can neither be assigned to any cluster nor can they be grouped into a sub-cluster of the CS. They are stored in main memory, as individual points, along with the statistics of the DS and CS. The statistics used to represent each cluster of the DS and each sub-cluster of the CS are: The count of the number of points, N. The vector of sums of the coordinates of the points in each dimension. The vector is called SUM, and the component in the ith dimension is SUMi . The vector of sums of squares of the coordinates of the points in each dimension, called SUMSQ. The component in dimension i is SUMSQi . 26 For k dimensions, 2k +1 are needed to compute important statistics of a cluster or subcluster. The mean and variance in each dimension are The coordinate i of the centroid of the cluster in dimension i is SUMi / N. SUMSQi SUM i 2 The variance in dimension i is ( ) N N 5.3.3 Fastmap Fastmap [25] picks k pairs of points (ai, bi), each of which pairs serves as the “ends" of one of the k axes of the k-dimension space. Using the law of cosines, we can calculate the “projection" x of any point c onto the line ab, using only the distances between points, not any assumed coordinates of these points in a plane. The diagram is c D(b, c) D(a, c) a x b D(a, b) And the formula is x D 2 ( a , c ) D 2 ( a ,b ) D 2 ( b , c ) 2 D ( a ,b ) Having picked a pair of points (a, b) as an axis, part of the distance between any two points c and d is accounted for by the projections of c and d onto line ab, and the remainder of the distance is in other dimensions. If the projections of c and d are x and y, respectively, then in the future (as we select other axes), the distance Dcurrent(c, d) should be related to the given distance function D by 2 Dcurrent ( c, d ) D 2 ( c, d ) ( x y ) 2 d D(c, d) Dcurrent(c, d) c a b x y 27 The Fastmap algorithm computes for each point c, k projections; c1, c2,…, ck onto the k axes, which are determined by pairs of points (a1, b1), (a2, b2), …, (ak, bk). For i = 1, 2, …, k, do the following: Using the current distance Dcurrent, pick ai and bi, as follows: o Pick a random point c. o Pick ai to be the point as far as possible from c, using distance Dcurrent. o Pick bi to be the point as far as possible from ai. For each point x, compute xi , using the law-of-cosines formula described above. Change the definition of Dcurrent to subtract the distance in the ith dimension as well as previous dimensions. That is D current (x; y) D 2 ( x, y) ( X ( j ) y ( j ) ) 2 j i 5.4 Hierarchical Clustering Hierarchical clustering [25, 38] is a general technique that could take, in the worst case, O(n2) time to cluster n points. The General outlines of this approach are as follows. Start with each point in a cluster by itself. Repeatedly select two clusters to merge. In general, we want to pick the two clusters that are closest, but there are various ways we could measure “closeness." Some possibilities: o Distance between their centroids (or if the space is not Euclidean, between their clustroids). o Minimum distance between nodes in the clusters. o Maximum distance between nodes in the clusters. o Average distance between nodes of the clusters. End the merger process when we have \few enough" clusters. Possibilities: o Use a k-means approach -- merge until only k clusters remain. o Stop merging clusters when the only clusters that can result from merging fail to meet some criterion of compactness, e.g., the average distance of nodes to their clustroid or centroid is too high. 5.4.1 The GRGPF Algorithm This algorithm [25, 38] assumes there is a distance measure D, but no Euclidean space. It also assumes that there is too much data to fit in main memory. The data structure it uses to store clusters is like an R-tree. Nodes of the tree are disk blocks, and we store different things at leaf and interior nodes: In leaf blocks, we store cluster features that summarize a cluster in a manner similar to BFR. However, since there is no Euclidean space, the \features are somewhat different, as follows: o The number of points in the cluster, N. o The clustroid: that point in the cluster that minimizes the rowsum, i.e., the sum of the squares of the distances to the other points of the cluster. 28 If C is a cluster, C’ will denote its clustroid. Thus, the rowsum of the clustroid is D(C ' , X ). x in c Notice that the rowsum of the clustroid is analogous to the statistic SUMSQ that was used in BFR. However, SUMSQ is relative to the origin of the Euclidean space, while GRGPF assumes no such space. The rowsum can be used to compute a statistic, the radius of the cluster that is analogous to the standard deviation of a cluster in BFR. The formula is radius = rowsum / N o The p points in the cluster that are closest to the clustroid and their rowsums, for some chosen constant p. o The p points in the cluster that are farthest from the clustroid. In interior nodes, we keep samples of the clustroids of the clusters represented by the descendants of this tree node. An effort is made to keep the clusters in each subtree close. As in an R-tree, the interior nodes thus inform about the approximate region in which clusters at their descendants are found. When we need to insert a point into some cluster, we start at the root and proceed down the tree, choosing only those paths along which a reasonably close cluster might be found, judging from the samples at each interior node. 5.4.2 CURE The outlines of the CURE algorithm [38] are: Start with a main memory full of random points. Cluster these points using the hierarchical approach. For each cluster, choose c “sample" points for some constant c. These points are picked to be as dispersed as possible, then moved slightly closer to the mean, as follows: o Pick the first sample point to be the point of the cluster farthest from the centroid. o Repeatedly pick additional sample points by choosing that point of the cluster whose minimum distance to an already chosen sample point is as great as possible. o When c sample points are chosen, move all the samples toward the centroid by some fractional distance, e.g., 20% of the way toward the centroid. As a result, the sample points need not be real points of the cluster, but that fact is unimportant. The net effect is that the samples are “typical" points, well dispersed around the cluster, no matter what the cluster's shape is. Assign all points, including those involved in steps (1) and (2) to the nearest cluster, where “nearest" means shortest distance to some sample point. Example 5.5: For c=6, 6 sample points are picked from an elongated cluster, and then moved 20% of the way toward the centroid. 29 3 2 5 + 6 1 4 6. Conclusions In this paper, we have reviewed various data mining techniques. The basic features that most data mining techniques should consider are: all approaches deal with large amounts of data, efficiency is required due to volume of data, and accuracy is an essential element in the data mining process. Many effective data mining techniques in association mining, classification and clustering have been discussed. Parallelism is an important trend in all data mining tasks. Also, it could show a promising reorganization of the potential benefit of involvement in the data mining step to accelerate the whole KDD process and to improve the results. Most parallel data mining techniques are derived from sequential data mining techniques. The methodologies used in these techniques could lead to less and shorter iterations within the knowledge discovery process loop. But, dealing with the data mining process in the parallel and distributed environment should be through a new generation of data mining techniques that are designed especially for the parallel and distributed environment. Another issue related to the nature of data should be considered. The process of choosing the used data mining technique should depend on some parameters related to the distribution and homogeneity of data. Some techniques could give excellent performance with some data volumes, while the same techniques would give a poor performance with other data volumes. References [1] [2] [3] [4] R. Agrawal, T. Imielinski and A.Swami, “Mining Association Rules between sets of items in large databases,” Proc. ACM-SIGMOD Int. Conf. On Management of Data, Washington, D.C. 1993. R. Agrawal and J.C. Shafer, “Parallel Mining of Association Rules,” IEEE Trans. On Knowledge and Data Eng., 8(6):962-969, December 1996. R. Agrawal and R. Srikant, “Fast Algorithms for mining association rules,” Proc. Of 20th VLDB Conference, 1994. R. Agrawal and R. Srikant, "Mining Sequential Patterns", In Proc. 11th Intl. Conf. On Data Engineering, Taipi, Taiwan, March 1995. 30 [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] M. Ankerst, M. Ester and H.P. Kriegel, "Visual Classification: An Interactive Approach to Decision Tree Construction", Proc. Int. Conf. on Knowledge Discovery and Data Mining, pp. 392-397, 1999. M. Ankerst, M. Ester, H.P. Kriegel, "Towards an Effective Cooperation of the User and the Computer for Classification", Proc. Int. Conf. on Knowledge Discovery and Data Mining, pp. 178-188, 2000. M. Berry and G. Linoff, Data Mining Techniques (For Marketing, Sales, and Customer Support), John Wiley & Sons, 1997. P. Bollmann-Sdorra, A.M. Hafez and V.V. Raghavan, “A Theoretical Framework for Association Mining based on the Boolean Retrieval Model,” DaWaK 2001, September 2001. R. Brachmann and T. Anand, "The Process of Knowledge Discovery in Databases: A Human-Centered Approach," Advances in Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, CA, pp. 37-58. L. Breiman, J.H. Friedman, R.A.Olshen, and C.J. Stone. Classification and Regression Trees. Wadsworth, Belmont, 1984. S. Brin et al., “Dynamic Itemset Counting and Implication Rules for Market Basket Data,” Proc. ACM SIGMOD Conf. Management of Data, ACM Press, New York, 1997, pp. 255–264. W. Buntine, "A Guide To The Literature On Learning Probabilistic Networks From Data." IEEE Transactions on Knowledge and Data Engineering 8, 2 (Apr. 1996), 195-210. W. Buntine, Graphical Models For Discovering Knowledge. In Advances In Knowledge Discovery And Data Mining, eds. V. Cherkassky and F. Mulier, Learning from Data, John Wiley & Sons, 1998. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurasamy (eds.), Advances in Knowledge Discovery and Data Mining, AAAI Press/ The MIT Press, 1996. U. Fayyad, G. Piatetsky-Shapiro and P. Smyth, "From Data Mining to Knowledge Discovery: An Overview," Advances in Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, CA, pp.1-30. U. Fayyad, G. Piatetsky-Shapiro and P. Smyth, "The KDD process for extracting useful knowledge from volumes of data," Communications of the ACM, 39(1). A. Freitas and S. Lavington, Mining Very Large Databases with Parallel Processing, Kluwer Academic Publishers, 1998. N. Friedman, D. Geiger and M. Goldszmidt, ‘Bayesian Network Classifiers,” Machine Learning 29:131-163, 1997. AM. Hafez, “A Dynamic Approach for Knowledge Discovery of Web Access Patterns”, ISMIS 2000, pp. 130-138. A.M. Hafez, “Association mining of dependency between time series,” Proceedings of SPIE Vol. 4384, SPIE AeroSense, April 2001. A.M. Hafez and V.V. Raghavan, "A Matrix Approach for Association Mining," the ISCA 10th International Conference on Intelligent Systems, June 13-15, 2001. E.H. Han, G. Karypis and V. Kumar, “Scalable Parallel Data Mining for Association Rules,” Proc. 1997 ACM-SIGMOD Int. Conf. On Management of Data, Tucson, Arizona, 1997. 31 [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] E.H. Han, G. Karypis, V.Kumar and B. Mobasher, “Clustering Based On Association Rule Hypergraphs,” SIGMOD’97 Workshop on Research Issues on Data Mining and Knowledge Discovery. A.K. Jain and R.C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988. M. Joshi, G. Karypis and V. Kumar, “ScalParC: A New Scalable and Efficient Parallel Classification Algorithms for Mining Large Datasets,” Proc. 12th International Parallel Processing Symposium (IPPS), Orlando, 1998. M. Joshi, G. Karypis and V. Kumar, “Parallel Algorithms for Sequential Associations: Issues and Challenges,” Minisymposium Talk at Ninth SIAM International Conference on Parallel Processing (PP’99), San Antonio, 1999. V. Kumar, A. Grama, A. Gupta and G.Karypis, Introduc-tion to Parallel Computing: Algorithm Design and Analysis. Benjamin-Cummings/Addison Wesley, Redwood City, CA, 1994. M. Mehta, R. Agarwal and J. Rissanen, “SLIQ: A fast scalable classifier for data mining,” In Proc. of 5th In-ternational Conference on Extending Database Technology (EBDT), Avignon, France, March 1996. D. Michie, D.J. Spiegelhalter and C. C. Taylor. Machine Learning, Neural and Statistical Classification. Ellis Horwood, 1994. D. Pyle, Data Preparation for Data Mining, Morgan Kaufmann, 1999. J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993. V.V. Raghavan and A.M. Hafez, “Dynamic Data Mining,“ IEA/AIE 2000, pp.220-229, 2000. A. Savasere, E. Omiecinski and S. Navathe, "An Efficient Algorithm for Mining Association Rules in large Databases," Proc. 21st Int'l Conf. of Very Large Data Bases, 1995 J. Shafer, R. Agarwal and M. Mehta, “SPRINT: A scalable parallel classifier for data mining,” In Proc. of 22nd Interna-tional Conference on Very Large Databases, Mumbai, India, September 1996. A. Srivastava, E.H. Han, V. Kumar and V. Singh, “Parallel Formulations of Decision-Tree Classification Algorithms,” Proc. 12th International Parallel Processing Symposium (IPPS), Orlando, 1998. R. Srikant and R. Agrawal, “Mining Sequential Patterns: Generalizations and Performance Improvements,” Proc. of 5th Int. Conf. On Extending Database Technology (EDBT), Avignon, France, 1996. A.K. Tung, J. Han, L.V. Lakshmanan and R.T. Ng, "Constraint-based Clustering in Large Databases," Proc. Int. Conf. on Database Theory, pp. 405-419, 2001. K. Wang, Y. He and J. Han, "Mining Frequent Itemsets Using Support Constraints," Proc. 26th Int. Conf. On very Large Data Bases, pp. 43-52, 2000. M. Ware, E. Frank, G. Holmes, M. Hall and I.H. Witten, "Interactive Machine Learning Letting Users Build Classifiers," http://www.cs.waikato.ac.nz/ ml/ publications.html S. M. Weiss and N. Indurkhya, Predictive Data Mining (a practical guide), Morgan Kaufmann Publishers,1998. P.C. Wong, "Visual Data Mining," IEEE Computer Graphics and Applications, Vol. 19(5), pp. 20-12, 1999. 32