An Efficient Architecture for Distributed Data Mining Abstract: Classification are the main focus topic in data mining and AI area. Many methods have been proposed by researchers in machine learning, business research and consulting system. In this paper, some important existing techniques of Classification are introduced such as: C4.5, SLIQ and SPRINT at first. Second, Serial and parallel approaches are introduced following. Finally, distribute parallel classification computing techniques are discussed. The object of this paper is to improve the efficiency and accuracy of existing classification techniques from perspective of pro and con condition of some of existing techniques. Construction of a classifier are the mainly focus on this paper, other detail such as decision tree pruning will not introduced. Many approaches have been introduced to improve existing classification algorithms. Among them, C4.5 is recognized as one of the popular classification methods. However, it has a strong restriction of the training datasets, which can fit into main memory. Dealing with the large training dataset is the main consideration of improving the efficiency and accuracy. Distributing System environment will be the final solution of huge training dataset problem. How to partition the training dataset among distribute host and how to deal with overhead shared-nothing processor's communication also are the key to improve the efficiency of parallel classification compute. We try to find ideal algorithm, which take advantage of all exiting parallel algorithms, in a distributing environment to reach the highest efficiency and accuracy of building a decision tree. The new parallel distributing system with agent technique seems to be is a solution, and the implementation of such system will be shown in future study. 1) Requirements Efficiency As we deal with very large real-world database efficiency and scalability of such algorithms become major concerns of data mining community. Many real world mining applications usually deal with very large training sets, i.e., millions of samples. However, existing data mining algorithms have a strong limitation for the training samples to reside in main memory. There is a critical tradeoff between efficiency and accuracy. The scalability to resolve the limitation of decision tree construction directly leads to inefficiency required frequent swapping of the training samples between main and cache memories. Early strategies for inducing trees from large databases include discrediting continuous attributes and sampling data at each node. These, however, still assume that the training set can fit in memory. An alternative method first partitions the data into subsets that individually can fit into memory, and then builds a decision tree from each subset. The final output classifier combines each classifier obtained from the subsets. Although this method allows for the classification of large data sets, its classification accuracy is not as high as the single classifier that would have been built using all of the data at once. Accuracy Influencing power Distribution Evolving The enhancement of mining techniques is a continuous process with the volume of data growing every day. The measure of a good mining algorithm is its ability to perform ‘effectively’ in mining useful knowledge and ‘efficiently’ in computational terms. This is a constantly evolving process (Cooley, 1997). 2) Back Ground: Decision Tree A decision tree, one of classification algorithms has been improved by a series of studies decision tree generation [Hunt et al.], CART [Breiman et al., 1984] and C4.5 [Quinlan, 1993]. A decision-tree is a tree, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represents classes distributions. The existing decision tree algorithms, such as ID3 and C4.5, has been well established for relatively small data sets. A decision tree is constructed by recursive divide-and-conquer algorithm. Most decision tree classifiers perform classification in two phases: Tree Building and tree Pruning. Compare with the time taken in Tree building phase, the time spent on tree Pruning is far smaller and about one percent of time of building tree. For the rest of this paper, Tree building phase is the main discussion. A summary of Tree building schema is following: Maketree (trainingdata T) Partition (T); Partition (data S); if (all points in S are in the same class) then return; Evaluate splits for each attribute A Use best split found to partition S into S1, and S2; partition(S1); Partition(S2); (assume we are building a binary decision tree) Let F being the interested dataset: if all the outcome in F are same, all instances of F belong to the same class, all attributes in F are examined and a test attribute A are selected and F is splitting base on this testing attribute: let A with mutually exclusive outcomes A1, A2, A3,...An and Let Fi be the subset of F containing those instances with outcome Ai, 1<= i <= n. the decision tree for F then has A as its root with a subtree for each outcome Ai of A. If Fi is empty, the subtree corresponding to outcome Ai is a leaf that nominates the majority class in F, otherwise, the subtree for Ai obtained by applying the same procedure to subset Fi of F. a leaf that identifies the most frequency class among the instances. There are several computational models to determine the partition criteria. Information gain, gain ratio, Gini index, X^2 contingency and the G-statistic how are the statistical formula to Choose a testing attribute. Information gain [Quilan, 1986] and Gini index [Breiman et al., 1984], which are the most popular two, are introduced here since some techniques introduced in the paper will used these two alternatively. Breiman et al. [1984] determine the impurity of a set of instance from its class distribution as follows: Gini(S) = 1- Pj -- (1) Gini-split(S) = N1/N*Gini(S1) + N2/N * Gini(S2) -- (2) In the formula, S stands for a data set containing examples from N classes, Pj stands for the relative frequency of a class j in S and Gini-split(S) stands for the index of the divided data when S is divided into two subsets S1 and S2 containing N1 and N2 classes, respectively. The advantage of the Gini index is that its calculation is required only the distribution of the class values in each of the partitions so that we just scan each of the node’s attribute lists and evaluate split points based on that attribute to find the best testing attribute for a node. The Gini index of a set of instances assumes its minimum value of zero when all instances belong to a single class. The information gain measure [Quinlan, 1993] is the most popular one used in choosing testing attribute. The attribute with the highest information gain (or greatest entropy reduction) is chosen as the testing attribute as follows: Info(S) = -(for j=1 to k){sum+= freq(Cj, S)/|S| * log2(freq(Cj, S)/|S|)bits} -- (1) Infox(T) = |Ti|/|T| * Info(Ti) -- (2) Gain(X) = Info(T) - Infox(T) -- (3) In the formula-(1), Info (S) is to sum over the classes (Cj) in proportion to their frequencies in S where there are k classes. In the formula-(2), Infox (T) measures the average amount of information needed to identify the class in T when applied to the set of training cases, Ti. (This quantity is also known as the entropy of the set S). Now consider a similar measurement after T has been partitioned in accordance with the n outcomes of a test X. The expected information requirement can found as the weighted sum over the subsets. Many researchers pointed out the strengths of C4.5 [Sprint, 19XX] as follows. First, the construction of decision tree classifiers is relatively fast compared to other classification methods, such as neural networks which is required extremely long training times even for small datasets. A decision tree is an intuitive method for understanding the process of building decision framework according to the data distribution of available attributes and converting it into classification rules or even into SQL queries for accessing databases. More interestingly, the accuracy of the tree classifiers is still compatible with other classification methods'. On the other hand, many pointed out the restriction of training data required to be residing in main memory. In data mining application, very large training sets are common. Hence, this restriction limits the scalability of such algorithms. The decision tree construction can become inefficient due to swapping of the training samples in and out of main and cache memories [classification by decision tree induction 293]. The real world dataset can be very large, even up to trillion. Furthermore, the large datasets are desired for improving the accuracy of the classification model. Developing efficient classification method for very large datasets is an essential and challenging problem. In recent years, many techniques have been taken to resolve large dataset, which can not fit into main memory, Selecting an appropriate set for training data: A technique called windowing is introduced to try to remove main memory restriction. A subset of the training case called a window was selected randomly and a decision tree developed from it. The tree produced by this subset was then used to classify the training cases which are exclusive with the subset used to build the tree. The result shows some of are misclassified. A selection of these exceptions was then added to the initial window, and to build a second tree to classify remaining cases. This cycle should be repeated until a tree built from the current window correctly classified all the training cases outside the window. The final window can be thought of as a screened set of training cases that contains all the "interesting" ones, together with sufficient "ordinary" cases to guide the tree-building. However, there are some problems with windowing. As Quinlan states, the cycle was repeated until a tree built from the current window correctly classified all the training cases outside the window. How long the process of constructing a decision tree will be last? Certainly this process will be taken much longer then building a single decision tree from a whole training case. Also, the training case used in C4.5 like experiments were mostly free of noise, for the most real-world classification domain, the process is even slower. In addition, it is possible that a cycle will be never ended until it run out main memory because of noise dataset. Data Partition/Distribution: Partition data set into different host, some of them try to retrieve a small set of useful data from a large noisy dataset and build some small decision tree from small subset of whole training case and combine them together to build a final decision tree. Chan and Stolfo [Chan, XX] have studied the method of partition the input data and then building a classifier for each partition. The outputs of the multiple classifier are then combined to get the final classification. Their results show that classification using multiple classifier never achieve the accuracy of a single classifier that can classify all of the data [XX]. Quilan et al., [Quilan, XX] also introduced a tree partition approach such as first, growing several alternative trees and selecting as the tree with the lowest predicted error rate. Second, growing several trees, generating production rules from all of them, then constructing a single production rule classifier from all the available rules. However, both two described here show that they are take much longer time to produce a final decision tree. Efficient Algorithms and Data Structure: Improve computation power using efficient data structure and algorithms. XX pointed out the bottleneck of C4.5 as their inefficient computation. Due to the use of a linear search algorithm for the threshold (continuous attributes) in the whole training set, the program is slow down. This choice is forced since the cases in the training set may not be ordered with the respect to the attribute selected for test. First, binary search of thresholds will speed up the linear search used in C4.5. Second, a counting sort method is adopted instead of the Quicksort of C4.5. SLIQ is a decision tree classifier that can handle both numeric and categorical attributes. It uses a novel pre-sorting technique in the tree-growth phase. This sorting procedure is integrated with a breadth-first tree growing strategy to enable classification of disk-resident datasets. The datasets is partition vertically or by attribute lists. For each single attribute, a RID (record identifier) associated with it, we can think of Rid is a pointer to make a connection between one entry from each attribute list and class list. See figure 4. The algorithm of Evaluating Splits in SLIQ: EvaluateSplits() for each attriubte A do traverse attribute list of A for each value v in the attribute list do find the corresponding entry in the class list, and hence the corresponding class and the leaf node update the class histogram in the leaf if A is a categorical attribute then for each leaf of the tree of the tree do find subset of A with best split The algorithm of Updating the class list of SLIQ After find a testing attribute and its splitting points, the next step is to create child nodes for each of the leaf nodes and update the class list: UpdateLables() for each attribute A used in a split do traverse attribute list of A for each value v in the attribute list do find the new class c to which v belongs by applying the splitting test at node referenced from e update the class label for e to c update node referenced in e to the child corresponding to the class c The strength of this approach is that finding best attribute and updating leaf in class leaf are in one pass scan of a attribute corresponding to a specific node of tree. Before constructing a tree, breath-first sorting is applied to the entire continuous attribute list and each record association can be maintained through a record id. This eliminates the need to sort the data at each node of the decision. Instead, the training data are sorted just once for each continuous attribute at the beginning of the tree growth phase. Second, attribute does not have to be in the main memory while constructing decision tree. So that the training dataset can be much larger than those have to fit into memory required by C4.5 like technique. Weakness of SLIQ: Since frequency update and random access of class list, SLIQ assumes that there is enough memory to keep the class list memory-resident. The size of the class list grows proportionally with the number of tuples in the training set. When a class list cannot fit into memory, the performance of SLIQ decreases. To overcome the limitation of SLIQ which require class list reside in main memory, SPRINT (Scalable Parallel Classifier for Data Mining) was introduced. It is a new decision-tree-based classification algorithm and try to removed all of the memory restrictions. SPRINT also claim that it is a fast and scalable algorithm. The goal of SPRINT was not to outperform SLIQ on datasets where a class list can fit in memory. Instead, the purpose of SPRINT is to develop an accurate classifier for datasets that are simply too large for any other algorithm, and to be able to develop such a classifier efficiently. In Sprint, all the information needed to find split points on a particular attribute associated with a node are stored in Histograms. For continuous attribute, two histograms, Cabove and Cbelow, are associated with each decision-tree node that is under consideration for splitting. Cbelow maintains the distribution for attribute records that have already been processed and Cabove maintains it for those that have not. In contrast, only one histogram is needed containing the class distribution for each value of the given attribute. A hash table is also needed in order for those non-testing attributes to be split according to the split point in the testing attribute. The hash table could be very large as the number of nodes under consideration of splitting increasing. SPRINT is interested in two major steps of tree growth phase, which is a critical performance implication: 1. How to find split points that define node tests. (continuous attribute are the main concerned here) Gini formula is used to find the testing attribute. For continuous attribute, The candidate split points are mid-points between every two consective attribute values in the subset. Cbelow is initialized to Zero and Cabove is initialized with the class distribution for a subset records for a specific node. During scan the subset record and updating histograms, if a wining split point is evaluated, it is saved , Cabove and Cbelow are deallocated. You must be wondering how Cabove is initialized, the Answer is sample, each time a attribute performance splitting (physically), Cabove is saved for a specific node and the subsets record associated with this node. For Categorical attribute, the processing is much sample. What we needed is just make a scan to the all the records and update the histogram. At the end of scan, calculate the Gini value and compare it with other candidates and deallocated the histogram. Once again, after spilt a attribute, a initial state of new histogram for each child must to save in order of future calculation 2. Having chosen a split point, how to partition the data. Once we find a testing attribute along with its split point, we start perform splitting. Scan the testing attribute one more time, Physically partition on it and probe partition information into a hashing table to indicate how other attribute should to do the partition. A hash table is used to provide a mapping between record identifiers and the node to which it belongs after the split. This mapping is then probed to spit the attribute lists in a consistent manner. Other non-testing attributes will perform the partition base on the information (RID associate with a child node) in the hashing table. The strengths of SPRINT are identified as follows: 1. SPRINT used the pre-sorting approach for every single continuous attribute. 2. SPRINT uses a attribute list data structure that holds the class and RID information, as show below. When a node is split, the attribute lists are partitioned and distributed among the resulting child nodes accordingly. When a list is partitioned, the order of the records in the list is maintained. Hence, partitioning lists does not require resorting. SPRINT was also designed to be easily parallel, further contributing to its scalability. the weakness of SPRINT are 1. SPRINT requires building class histograms for each new leaf, which is used to initialize the Cabove histograms when evaluating continuous split-points in the next pass. Due to frequent updating, Cabove histograms has to be in main memory, as the number of node under consideration of split increasing, Cabove could be very large and it is possible that the total space required to store Cabove larger than the main memory. 2. SPRINT requires the use of a hash tree proportional in size to the training set. This may become expensive as the training set size grows. Parallelizing Classification (perform classification in a shared-nothing multiprocessors distributing system environment). There are two primary approaches for parallelizing SLIQ: 1. The class list is replicated in the memory of every processor. Performing the splits requires updating the class list for each training example. Since every processor must maintain a consistent copy of the entire class list every class-list update must be communicated to and applied by every processor. Thus, the time of this part of tree growth will increase with the size of the training set, even if the amount of data at each node remains fixed. Therefore, the size of the training set is limited by the memory size of a single processor, since each processor has a full copy of the class list. 2. Each processor's memory holds only a portion of the entire list. After partitioning the class list in parallel. Each of N processors contains only 1/N of the class list. Note that the class label corresponding to an attribute value could reside on a different processor. It will cost high communication while evaluating continuous split points. As each attribute list is scanned, we need to look-up the corresponding class label and tree point for each attribute value. This implies that each processor will require communication for N-1/N of its data. Also each processor will have to service lookup requests from other processors in the middle of scanning its attribute lists. Parallelizing SPRINT: SPRINT achieves uniform data placement and workload balancing by distributing the attribute list evenly over N processors of a shared-nothing machine. This allows each processor to work on only 1/N of the total data. The parallel partition is showing following: Each processor has a separate contiguous section of a "global" attribute list. For categorical attribute, since the count matrix built by each processor is based on "local" information only, we must exchange these matrices to get the "global" counts in order to calculate the Gini value. For continuous attributes extra processor's Cbelow and Cabove histograms must be initialized to reflect the fact that there are section of the attribute list on other processors. As in the serial version, The statistics are gathered when attribute lists for new leaves are created. After collecting statistics, the information is exchanged between all the processors and stored with each leaf, where it is later used to initialize that leaf's Cabove and Cbelow class histograms. Weakness: Note that a processor can have attribute records belonging to any leaf. Before building the probe structure, it will need to collect rids from all the processors, it is quite expensive. Also, calculating a set of record, associated with a specific leaf, will required some processors who has part or that dataset passing the information in order to update the histogram. This approach is also very expensive. 2) Our approach 2.1) Distributed computing using multiple Agent Data distribution (distributed database), partition attributes: Typical data classification requires a two-step process. The first step, a model is built describing a predetermining set of data classes or concepts. The model constructed by analyzing database tuples described by attributes. Each tuple is assumed to belong to a predefined class, as determined by one of the attributes. In the context of classification, data tuples are also referred to as sample, example, or objects. The data tuples analyzed build the model collectively form the training dataset. The individual tuples making up the training set are referred to as training samples and are randomly selected from the sample population. By repeatedly splitting the data into smaller and smaller partitions, decision tree induction is prone to the problems of fragmentation, repetition, and replication. In fragmentation, the number of samples at a given branch becomes so small as to be statistically insignificant. One solution to this problem is to allow for the grouping of categorical attribute values. A tree node may test whether the value of an attribute belongs to a given set of values, such as Ai € {a1, a2, ..an}. Another alternative is to create binary decision trees, the each branch holds a Boolean test on an attribute. A pipe line approach may improve the efficiency of SPRINT: instead save extra Cabove and Cbelow class histograms. Processor 0 may implement first and pass the state of histograms to next processor and continuous to work on next subset of attribute record. after processor 1 receive the state of histograms from processor it will do the same thing which processor 0 has done and pass on to next processor. This cycle will be going on until the last processor finish its job. Introduce new distribute classification system: Train data may be distributed across a set of computers for several reasons. For example, several data sets concerning customers might owned by different insurance companies who have competitive reasons for keeping the data private. However, the organization would be interested in models of the aggregate data. Also parallel retrieve information from the data set will improve the efficiency of constructing decision tree. In our system, we used basic decision-tree algorithm, and adopts per-sort, RID lists and histograms technique approaches from SPRINT. However, we eliminate the SPRINT's processor communication overhead. In addition, partition the training data vertically among share-nothing processors which is same as parallel SLIQ. However, a class list, which is required reside in main memory, are eliminated. Agent techniques are used in our system since it can help each processor have self-motivation to perform more efficiently. Data structure: Histograms: same as SPRINT, each node under consideration of spitting will have initial histograms, one histogram for categorical attributes and two for continuous attributes. Since all the information needed to calculate information gain are in the Histograms. Message Queue: perform assignment to each agent. Every time the master agent has a new assignment, it will enqueue a massage in to Message queue. Every time a slave agent finished a task from remote host, it will go to Message queue to dequeue a new assignment (if it is not empty) from there and start new mission. Beware, the dequeue procedure is exclusive among agents, that is, while a agent is dequeuing, no one else can performance dequeue. Since the job assignment is dynamically assignment to agents, which agent get job first or last is not a problem, so starvation will not be concerned here. Building Binary tree (decision tree): after collection of information from slave agent send back from remote host, local master agent will make a decision and construct a node in decision tree and make future assignments if it is necessary. Encoding technical [lowerbound, upperbound] is used in constructing a binary tree since it can make access a leaf in log(n) time. Data placement and Workload Balancing: Different from SPRINT, we partition the data vertically. Each processor, which is sharenothing, working independently and parallel. For example: we have a training dataset with N attribute lists in a M processor distribute environment, Then N attribute divided evenly among M processors. The scheme for each attribute list stored in each processor will be -> (attribute list + rid + class list) Even one processor may have more than one-attribute lists physically. However, logically, we can treat a single processor executes on only one attribute list. Since we adopt the histogram technique from SPRINT, the dataset don't have to be residing in main memory. We only keep the necessary histograms in main memory. Finding split points: Finding split points in parallel in our system is very simple, after a slave agent was born, it will to go assignment queue to find a new assignment, in which a subset of training set along its location are specified. The agent then will go the location, this can be processor 0, to calculate the information gain base on the histogram. and bring back the best information gain of that dataset. The master agent then collects the information form each agent regarding a leaf and make final decision of choosing the testing attribute and splitting point. Performing The Splits: Having determined the winning testing attribute along with it's split points, master agent will insert a new node to decision tree and make assignments enqueue into assignment queue. Performing the splitting is really a logical splitting instead physical splitting. A new initial histogram for new subset, if there is one, will be saved for each new children leaf for future using. Algorithm: Generate_decision_tree. Generate a decision tree from the given training data. Input: The training samples, samples, the set of candidate attributes attribute-list ( Distribute in remote hosts). Output: A decision tree. (1) create a node M ( label with Number of attributes under consideration T); (2) create a manager agent and N slave agents. (3) enqueue N assignments (if there are N attributes under consideration ) into Assignments Queue. (4) for each agent at local host (simultaneously) 4.1 dequeue a Assignment package for Assignment Queue, 4.2 base on information stored in Assignment package, dynamically travel to remote host to learn a specific attribute. 4.3 sending a information gain and possible splitting subsets information back manager agent at local host. 4.4 travel back to local host to find another assignment in Assignment Queue. (5) While manager agent collecting information from Slave agents, if all T information gains associated with the node M are collected. manger will make a decision: if sample are all of the same class, C then return N as a leaf node labeled with the class C; if attribute-list is empty then return N as a leaf node labeled with the most common class in sample; // majority voting 5.1 select test-attribute, the attribute among attribute-list with the highest information gain. label node N with test-attribute; 5.2 partition test-attribute into two subsets: (base on the information package from a slave agent). B1 and B2. 5.3 grow two branch from node N for the condition test-attribute = B1 and B2 accordingly; 5.4 for each branch node: let si be the set of samples in samples for which test-attribute = Bi ; //a partition. if si is empty then attach a leaf labeled with the most common class in samples; else manager agent enqueue (n-1) assignments packages associated with this particular node into Assignments Queue recursive back to label (4). What will stored in Assignment package: remote host address name of attribute total records of the attribute node id [lowerbound, upperbound] What will stored in information package which a slave agent will send back to local host: remote host address name of attribute information gain left subset right subset number of records in left subset number of records in right subset How an slave agent perform in remote host: 1. read the database description in attributeName.dat to learn the attribute in this host : 1.1 continues or directed. 1.2 the outcome of this attribute. 1.3 If is discrete attribute, slave agent also needed to know possible districted values of the attribute. 2. initialized a histogram base on the current status of the attribute 3. calculate information gain or Gini value while scanning attributeName.mdb: if it is discrete attribute, keep update histogram until last record was calculated. if it is continuous attribute, will update histogram, n-1 information gain is continuously updating and eventually, a winning information gain will be stored in return information package. 4. stored subset record Id into retune information package. 5. once there are enough information in return information package, slave agent will send the package back to manager agent. and travel back to local host to seeking other assignment. Computation distribution employing mobile agents 2.2) Middleware (XML and metadata) 1. Naming (teasarus) 2. XML: Data integration and exchange 3. Reusable pattern repository (multi-dimensional pattern specification, similarity and difference measure, pattern finder and matching schema, decision rules) (1) to avoid repeated process to generate overlapped information, (2) to increase efficiency and accuracy by increasing the patterns, (3) to provide user's participant and refer domain knowledge (ontology), (4) to obtain sampling data from available domain library 1. alias (synonym) 2.3) 1. 2. 3. 2.4) 1. 2. 3. 4. 2.5) 1. 2. 3. 4. 5. 2. attribute selection, 3. attribute schema, constraints 4. information gain computation, 5. decision tree building strategy, 6. classification rules 7. categorization 8. pruning schema Distributed data mining algorithms SLIQS SPRINT Pipeline (enhancement of SPRINT) Efficient data representation and integrated data mining algorithms CBR: integration of Classification and association, FP-tree: development of efficient data structure Cooperate 2 with 1 (encoding) Dynamic update; efficient, reusable, dynamic update Meta-data and Ontology Class hierarchy Integration between KB and data mining Using Ontology for efficient mining Storing the mining results to ontology Use the ontology for query Reference: R. Cooley, B. Mobasher, J. Srivastava, Web Mining: Information & Pattern Discovery on the WWW, 1997