Optimized Association Rules with Effective algorithm for Large

Optimized Association Rules with Effective algorithm for Large Database Priyanka Mandrai1, Raju Barskar2 1 2Assistant Student, CSE Department, U.I.T. RGPV Bhopal, Madhya Pradesh, India priyanka.mandrai1988@gmail.com Professor, University CSE Department, U.I.T. RGPV Bhopal, Madhya Pradesh, India raju_barskar53@rediffmail.com Abstract Association Rule mining is a very efficient technique for finding correlation among data sets. The correlation of data gives meaning full extraction process. For the mining of rules varieties of algorithms are used such as Apriori algorithm and Tree based algorithm. Some algorithm is wonder performance but generates redundant association rule and also suffered from multiscan problem. In this paper we proposed a K-Apriori-GA association rule mining algorithm based on genetic algorithm and K-map formula. In this method we used a K-map binary table for partition of data table at 0 and 1. The divided process reduces the scanning time of the database. The proposed algorithm is a combination of KApriori and Genetic algorithm. The support weight key is a vector value given by the transaction data set. The process of rule optimization we used genetic algorithm and for evaluating algorithm conducted the real world dataset The National Rural Employment Guarantee Act (NREGA) Department of Rural Development Government of India and also compare standard Apriori algorithm and K-Apriori algorithm with our proposed algorithm. Keywords: - Association rule mining, Redundant rules, Multi-pass, K-map, Genetic algorithm, NREGA 1. Introduction Association rule mining is a technique to detect the unknown facts in large dataset and describe interferences on how subsets of items manipulate the presence of other subsets. Association rule mining aims to discover strong relations between attributes. All frequent generalized patterns are not very efficient because a portion of the frequent Patterns are redundant in the association rule mining and taking more time to generate rules. This redundant rules drawback can be overcome with the help of genetic algorithm. Since most of the data mining approaches use the greedy algorithm instead of genetic algorithm. Genetic algorithm is somewhat better as compared to the greedy algorithm because it performs a global search and copes better with the attribute interaction. In genetic algorithm population evolution is simulated. A genetic algorithm is a biological technique which uses chromosome as an element on which solutions (individuals) are manipulated. Another drawback is multi-passing this can be overcome with the help of a K - map formula. K-map binary table for partition of data table at 0 and 1. The divided process reduces the scanning time of the database. The proposed algorithm is a combination of K-Apriori and genetic algorithm. In this method Support weight key is a vector value given by the transaction data set. In this paper we compare Apriori and KApriori algorithm with our proposed algorithm. The application of association rule mining is on market basket data, weather prediction, multimedia data etc. The rest of the paper organized as follows. In section 2 discuss association rule mining. The section 3 discusses association rule mining algorithms. Section 4 described the literature survey. Section 5 discusses previous algorithms. Section 6 discusses proposed algorithm, and finally we explain the experimental result in section 7 and concluded in section 8. 2. Association Rule Mining 2.1 Problem Definition The problem of discovering association rules is expressed as follows: given an information of sales transaction [1], it's useful to get the vital associations between items specified the event of any items during a transaction can involve the existence of alternative items within the same transaction. AN example of AN association rule is: “37% of transactions that contain bread also contains milk; 7% of all transactions contain each item”. Here, 37% is called the confidence of the rule, and 7% the support of the rule. The problem is to determine all association rules that assure user-specified minimum support and minimum confidence constraints. The problem of mining association rules [2] be initial introduce in and therefore the following prescribed definition be proposed [3] in to address the problem. 2.2 Definitions: Association Rule Mining Let 𝐼 = (𝑖1, 𝑖2 … … . , 𝑖𝑚 ), be a group of items. Let D be a group of transactions, wherever all transaction T includes one or more set of items in 𝐼, such 𝑇 ⊆ 𝐼. Every transaction is associated with a unique identifier, called 𝑇𝐼𝐷. Let 𝑋 be a group of items. A transaction 𝑇 is claimed to contain 𝑋 if and only if 𝑋 ⊆ 𝑇. An association rule is defined as an expression 𝑋 ⇒ 𝑌, wherever 𝑋 and 𝑌 are non empty item sets (i.e. 𝑋 ⊆ 𝐼, 𝑌 ⊆ 𝐼). This rule is termed as antecedent, such that 𝑋 ∩ 𝑌 = ∅. The rule 𝑋 ⇒ 𝑌 holds within the transaction set 𝐷 with support 𝑠, wherever 𝑠% of transactions in 𝐷 that contain 𝑋 ∪ 𝑌. The rule 𝑋 ⇒ 𝑌 has confidence 𝑐, within the transaction set 𝐷, wherever 𝑐% of transactions in 𝐷 contain 𝑋 that also contain 𝑌. Support: The rule 𝑋 ⇒ 𝑌 has support s within the transaction set 𝐷, if this is the case of transactions in 𝐷 contains 𝑋 ∪ 𝑌. Rules that have a 𝑠 larger than or equal to a user-specified support is termed as a minimum support threshold (𝑚𝑖𝑛_𝑠𝑢𝑝). 𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝑋 ⇒ 𝑌) = 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝑋 ∪ 𝑌) = 𝑃(𝑋 ∪ 𝑌) Confidence: The rule 𝑋 ⇒ 𝑌 has confidence 𝑐 within the transaction set 𝐷, if recollect transactions in 𝐷 contain 𝑋 that also contain 𝑌. Rules that have a 𝑐 larger than or equal to a userspecified confidence is termed as a minimum confidence threshold ( 𝑚𝑖𝑛_𝑐𝑜𝑛𝑓). 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 (𝑋 ∪ 𝑌) 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 (𝑋) 𝑌 = 𝑃( ) 𝑋 Typically large confidence values and a smaller support are used. Rules that satisfy each minimum support and minimum confidence are known as robust rules. Since the information is large and users' concern about only those frequently purchased items, sometimes thresholds of support and confidence are predefined by users to drop those rules that don't seem to be thus remarkable or helpful. The problem of discovering all association rules will be divided into two sub issues [3]. (1) Find all sets of items (itemsets) that have transaction support higher than the minimum support. These are the frequent itemsets. Alternative itemset referred to as infrequent itemsets. (2) Use the frequent itemsets to get the specified rules. There's a large union between the literature that the primary sub problem is that the mainly necessary of the two. This will be because it's more time consuming because of the enormous search space and therefore the rule generation section can be done in main memory in a very simple means once the frequent itemsets are found. That's the reason for the huge awareness researchers paid to the current problem within the recent year. 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 (𝑋 ⇒ 𝑌) = 3. Association Rule Mining Algorithms 3.1AIS Algorithm The AIS (Agrawal, Imielinski, Swami) algorithm was the primary the first discovered by Agrawal et al. 1993 [2] for mining association rule. It focuses on reusing the class of databases usually with the required functionality to method decision support queries. During this algorithm only one item important association rules are generated, which implies that the results of these rules only contain one item, as an example, generate rules like 𝑋 ∩ 𝑌 ⇒ 𝑍 however not those rules as 𝑋 ⇒ 𝑌 ∩ 𝑍. The databases were scanned again and again to get the frequent itemsets in AIS. To create this algorithm more efficient, an estimation technique was proposed to prune those itemsets candidates that don't have any expect to be large, so the unnecessary try of counting those itemsets will be avoided. As all the candidate itemsets and frequent itemsets are implied to behold on within the main memory, memory management is also proposed for AIS once memory isn't enough. The key disadvantage of the AIS algorithm is just too many candidate itemsets that finally turned out to be little are generated, which needs more space and wastes a lot of effort that turned out to be ineffective. At the constant time this algorithm needs too several passes over the entire database. 3.2 Apriori Algorithm Apriori is a huge enhancement in the record of association rule mining, Apriori algorithm was first introduced by Agrawal and Srikant 1994 [3]. The AIS has been presenting a straightforward approach that requires several passes over the database, generating a variety of candidate itemsets and storing counters of every candidate while the majority of them turn out to be not frequent. Apriori is extra capable in the candidate generation process for two reasons, Apriori employs a different candidate generation technique and a new pruning technique. Apriori algorithm quiet inherits the problem of scanning the entire databases several times. Based on Apriori algorithm, various new algorithms were proposed with a only some modifications or improvements. Typically there were two approaches: one is to reduce the number of passes more than the entire database or replacing the complete database with only component of it based on the current frequent itemsets, one more approach is to discover different kinds of pruning process to produce the number of candidate itemsets much lesser. Apriori-TID and AprioriHybrid [3] , DHP [4], SON [5] are modifications of the Apriori algorithm. The best part of the algorithms introduced above are based on the Apriori algorithm and try to improve the efficiency by creation a only some modifications, such as reducing the number of passes over the database; reducing the range of the database to be scanned in each pass; pruning the candidates by different techniques and using sampling technique. But there are two drawbacks of the Apriori algorithm. One is the complex candidate generation method that uses most of the time, space and memory. Another drawback is the several scan of the database. 3.3Partition Algorithm This method was exploited in the partition algorithm, introduced by savasere et al. In 1995 [5]. The algorithm decreases the database activity by computing all frequent sets in two passes over the database. The algorithm workings also on a level-wise manner, but the thought is to partition the database into sections small enough to be handled in main memory. That is, a part is examine once from the disk, and level-wise generation and estimation of candidates for that part are performed in main memory with no further database activity. The main accomplishment of partition is the decreases of the database activity. It was revealed that this reduction was not obtained at the expense of more CPU utilization, which is another success. 3.4 Pincer-Search Algorithm Apriori algorithm has to go during several iterations and, as a result, the performance reduces. To defeat this complexity is to someway integrate a bi-directional search, which takes benefit of both the bottom-up as well as the top-down method. The pincer-search algorithm was proposed Lin et al. 1997 [6], is based on this principle. It attempts to find the frequent itemset in a bottom-up approach, but at the same time. It maintains a list of maximal frequent itemset. In most cases, our algorithm not only decreases the number of passes of reading the database but also can decrease the number of candidates (for whom support is counted). In such cases, both I/O time and CPU time are decreased by eliminating the candidate that are subsets of maximal frequent itemsets establish it MFCS. The pincer-search has improvement over Apriori algorithm while the largest frequent itemset is long. 3.5 Dynamic Itemset Counting Algorithm One more algorithm called DIC (Dynamic Itemset Counting) was introduced by Brin et al. 1997 [7]. It tries to condense the database activity by counting candidate itemset previous than Apriori does. In Apriori, candidate 𝑘 + 1-itemsets are not counted until the 𝐾 + 1𝑡ℎ Pass to the database. In DIC, on the other hand, candidate 𝑘 + 1-itemsets are counted once the algorithm discovers that all its subsets of size 𝑘 have exceeded the support threshold and will be frequent. This is completed by stopping at several points in the database to observe the prospect of with other itemsets in the counting process. It has been establish that such technique, with practical settings of the number of transactions passed earlier than stopping for recalculation, can decrease the number of database passes considerably whereas maintaining the number of candidate sets that require to be counted rather low compared to other proposed techniques. The algorithm, throughout more capable than Apriori, was establish to be exposed to data characteristics. A remedy was recommended for this difficulty with new success. It was also shown that by the logical order of items according to definite criteria, the performance of the algorithm increases dramatically. However, this item reordering strategy incurs a lot of overhead. Since generating the huge 2-itemsets are the most costly method during the mining process, experiments in [9] showed that the efficiency of generating huge 1-itemsets and 2-itemsets in the SOTrieIT algorithm improves the performance dramatically, SOTrieIT is much faster than FPTree, but SOTrieIT also faces the similar difficulty as a FP - Tree. 3.8 Generalized Association Rule Mining 3.6 FP-Tree Algorithm (Frequent Pattern Tree) To break the two drawbacks of Apriori algorithms, association rule mining using tree structure has been considered. FP-Tree [8], frequent pattern mining, is another objective in the improvement of association rule mining, which breaks the two drawbacks of the Apriori. The frequent itemsets are generated with only two passes over the database and without any candidate generation method. FPTree was proposed by Han et al in 2000 [8]. By avoiding the candidate generation method and less passes over the database, FP-Tree is an order of size faster than the Apriori algorithm. The frequent pattern generation process includes two sub processes: constructing the FT-Tree, and generating frequent patterns from the FP-Tree. Each algorithm has his limitations, for FP-Tree it is not easy to be used in an interactive mining system. During the interactive mining process, users may change the threshold of support according to the rules. However for FP-Tree the varying of support may lead to recurrence of the entire mining method. One more limitation is that FP-Tree is that it is not appropriate for incremental mining. While as time goes on databases keep changing, new datasets may be inserted into the database, those insertions can also lead to a repetition of the entire process if we employ FP-Tree algorithm. 3.7 Rapid Algorithm Association Rule Mining RARM was introduced by Das et al. 2001 [9] is another association rule mining process that uses the tree structure to represent the unique database and avoids the candidate generation method. RARM is claimed to be much faster than the FP tree algorithm with the experiments result shown in the original paper [9]. By using the SOTrieIT structure RARM can generate large 1-itemsets and 2-itemsets rapidly without scanning the database for the second time and candidates generation. Similar to the FP-Tree, each node of the SOTrieIT contains one item and the equivalent support count. Generalized association [10] rules use the existence of a hierarchical classification (concept hierarchy) of the data to produce different association rules at different levels in the classification [Srikant1995]. Association rules, and the support-confidence structure used to mine them, are well-suited to the market-basket problem. When association rules are generated, them at any of the hierarchical levels present. As would be predictable, when rules are generated for items at an advanced level in the classification, both the support and confidence increase. A generalized association rule, X⇒Y, is defined identically to that of regular association rule, except that no item in Y can be an ancestor of any in X [Srikant1995]. A supermarket may want to find associations concerning to soft drinks in general or may want to identify those for an exact brand or type of soft drink (such as a cola). The generalized association rules allow this to be accomplished and also ensure that all association rules (even those across levels in different taxonomies are established. 4. Literature Survey of Association Rule Mining Problem: The computational cost of association rule mining can be reduced in three ways:  By reducing the number of passes  Remove the redundant association rule  Remove the negative association rule 4.1 Reducing the Number of Passes The two algorithms that are Apriori and AprioriTid, which determine all significant association rules between items in a huge database of transactions was introduced by Agrawal et al. [3]. The best features of the two proposed algorithms can be shared into a hybrid algorithm, called AprioriHybrid. Scale-up experiments established that AprioriHybrid scales linearly with the number of transactions. In addition, the execution time decreases a small as the number of items in the database increases. As the average transaction size increases (while keeping the database size constant), the execution time increases only gradually. AIS and SETM have always been outperformed by the Apriori and AprioriTid algorithms. There was a considerable increase in the performance gap with the increase in problem size, ranging from a factor of three for tiny problems to more than an order of magnitude for huge ones. J.S. Park et al. [4] Have proposed a Direct Hashing and Pruning (DHP) algorithm for efficient large itemset generation. The proposed algorithm has two major features: one is efficient generation for large itemsets and other is an effective reduction of transaction database size. Using the hash techniques, DHP is very efficient for the generation of candidate set in large 2itemsets. Hidber et al. [11] has presented a novel algorithm named Continuous Association Rule Mining Algorithm (CARMA), which is used to compute large itemsets online. In this algorithm it continuously produced large itemsets along with a shrinking support interval for each itemset. In CARMA's itemset lattice quickly approximates a superset of all large itemsets while the sizes of the corresponding support intervals shrink rapidly. The memory efficiency of CARMA was an order of magnitude greater than Apriori. Apriori and DIC (Dynamic Itemset Counting) [7] fell behind CARMA on low support thresholds. Besides, the CARMA has been found to be sixty times more memory efficient. Wang et al. [12] proposed a new class of interesting problem called weighted association rule (WAR), which mines WARs by first ignoring the weight and finding the frequent itemsets and it was followed by introducing the weight during the rule generation. This approach shorter average execution time. Bodon et al. [13] has analyzed theoretically and experimentally Apriori [3], the most established algorithms for frequent itemset mining. The implementations of the Apriori algorithm have displayed large differences in running time and memory need. Which modified Apriori and named it as Apriori_Brave that appears to be faster than the original algorithm. Li, et al. [14] proposed a new single-pass algorithm, called Data Stream Mining for Frequent Itemsets (DSM-FI), which mines all frequent itemsets over the entire history of data streams. DSM-FI outperforms the Lossy Counting [15] in terms of execution time and memory usage between the large data sets. Ye et al. [16] have implemented a parallel Apriori algorithm based on Bodon’s work [13] and analyzed its performance on a parallel computer. Their implementation was a partition based Apriori algorithm that partitions a transaction database. They are also partitioning a transaction database has improved the performance of frequent itemsets mining by fitting each partition in limited main memory for quick access and allowing incremental generation of frequent itemsets. Another algorithm for efficient generating large frequent candidate sets is proposed by Yuan [17], which is called Matrix Algorithm. The algorithm generates a matrix which entries 1 or 0 by passing over the cruel database only once, and then the frequent candidate sets are obtained from the resulting matrix. Finally association rules are mined from the frequent candidate sets. This algorithm is more effective than Apriori Algorithm. Huan Wu et al. [18] proposed Apriori-based algorithm IAA. IAA adopts a new count-based method to prune candidate itemsets and decreasing the mount of scan data by candidate generation record, this algorithm can reduce the redundant operation while generating frequent itemsets and association rules in the database. Ghosh was using GA in the discovery of frequent itemsets is that they perform global search and its time complexity is less compared to other algorithms as the genetic algorithm is based on the greedy approach. In this method to find all the frequent itemsets from given data sets using genetic algorithm. Paul et al. [19] proposed by Optimized Distributed Association Mining Algorithm is used for the mining process distributed environment. The response time with the communication and computation factors are considered to achieve an improved response time. The performance analysis is done by increasing the number of processors in a distributed environment. As the mining process is done in parallel an optimal solution is obtained. Gautam et al. [20] proposed by Multilevel association rule mining algorithm based on the Boolean Matrix (MLBM). It adopts Boolean relational calculus to discover maximum frequent itemsets at lower levels. When using this algorithm first time, it scans the database once and will generate the association rules. Apriori property is used to prune the item sets. It is not necessary to scan the database again. In addition, it stores all transaction data in bits, so it needs less memory space and can be applied to mining large transaction databases. Ghosh et al. [21] was used GA in the discovery of frequent itemsets is that they perform global search and its time complexity is less compared to other algorithms as the genetic algorithm is based on the greedy approach. In this method to find all the frequent itemsets from given data sets using genetic algorithm. Kamal et al. [22] propose a novel method, transactional pattern base where transactions with the same pattern are added as their frequency is increased. Thus subsequently scanning requires only scanning this compact dataset which increases efficiency of the respective methods. Which is used two-dimensional matrix instead of using FP-Growth method. The matrix, used to generate Association rules, is very small in size as compared to FP Tree. The rules are generated more quickly. The size of the matrix is not directly proportional to the no of transactions. If the frequency of transactions is high, the size of the matrix will be even smaller. To overcome the problem of Matrix Apriori Algorithm it should be repeated with every single update. Therefore, Oguz et al. [23] proposed a dynamic frequent itemset mining algorithm, called Dynamic Matrix Apriori is proposed. It handles additions and deletions as well. It also manages the challenges of new items and support changes. The main advantage of the algorithm is avoiding the entire database scan when it is updated. It scans only the increments. 4.2 Redundant Association Rules To deal with the problem of rule redundancy, various types of research on mining association rules have been performed. Cristofor et al. [24] proposed inference rules or inference systems to prune redundant rules and thus present smaller, and usually more understandable sets of association rules to the user. Ashrafi et al [25] presented several methods to eliminate redundant rules and to produce a small number of rules from any given frequent or frequent closed itemsets generated. Ashrafi et al [26] present additional redundant rule elimination methods that first identify the rules that have similar meaning and then eliminate those rules. Furthermore, their methods eliminate redundant rules in such a way that they never drop any higher confidence or interesting rules from the resultant rule set. Another approach called MTRFMA (modified transaction reduction based frequent itemset mining algorithm) developed by Thevar et al. [27] maintains its performance even at relatively low supports. AL-Zawaidah et al. Present a novel approach Feature Based Association Rule Mining Algorithm (FARMA) [28] in this algorithm Leverage measure using minimum leverage thresholds at the same time incorporates an implicit frequency constraint and find all itemsets with minimum support and then filter the found item sets using the leverage constraint. By using this algorithm to reduce the generation of candidate itemsets and thus reduce the memory requirements to store a huge number of useless candidates. Another approach is also based on genetic algorithm i.e. Mining Optimized Association Rules Algorithm (MOAR) [29] proposed by Wakabi. In this algorithm using the standard SsTs−dominance relations causes some interesting or large number of rules are found. When dominance is solely determined through support and confidence, there is a high chance of eliminating interesting rules. It should be a mechanism for managing their large numbers and also to significantly improve the response time of the algorithm. To overcome the problem of negative rules and superiority uses a genetic algorithm which gives us the optimized association rules by using the Apriori algorithm. Jain [30] was proposed a new algorithm that combination of support weight value and near distance of superior candidate key and parity based selection of rule based on the group value of the rule. This technique applied to the synthetic database, that generated the desired rules. Rangaswamy et al. [31] was to implement association rule mining of data using genetic algorithms. It improves the performance of accessing information from databases (Log file) maintained on the server machine. This system was to find all the possible optimized rules from a given data set, using genetic algorithm for minimizing the time required for scanning huge databases. Indira et al. [32] analyzes the performance of the GA in Mining ARs effectively based on the variations and modification of GA parameters. The fitness function, crossover rate, and mutation rate parameters are proved to be the primary parameters include an implementation of genetic algorithms. 5. Previous Algorithms 5.1 Apriori Algorithm Apriori Algorithm was proposed by Agrawal et.al. [2], which is considered as one of the most contributions to association rule mining. Its main algorithm, Apriori, has affected not only the association rule mining community, but other data mining fields as well. The Apriori algorithm for finding all large item set makes multiple passes over the database. In the first pass, the algorithm counts item occurrences to determine large 1-item sets. The subsequent pass, say pass 𝑘, consist of two steps. First, the large item sets 𝐿𝑘−1 found in the (𝑘 − 1)𝑡ℎ pass are used to generate the candidate item sets 𝐶𝑘 . Then, all those item set which have some (𝑘 − 1) subset that is not in 𝐿𝑘−1 are deleted, yielding 𝐶𝑘 . Algorithm: Apriori algorithm 1) Begin 2) 𝐿1 = {frequent 1-item sets}; 3) for ( k = 2; 𝐿𝑘−1 ≠ ∅; k++ ) do begin 4) 𝐶𝑘 = Apriori-Gen(𝐿𝑘−1 ); 5) for all transactions 𝑡 𝜖 𝐷 do begin 6) for all candidates 𝑐 𝜖 𝐶𝑘 contained in t do 7) c.count++; 8) end 9) 𝐿𝑘 = { 𝜖 𝐶𝑘 | c.count ≥ minsup} 10) end 11) end 12) Answer = ∪𝑘 𝐿𝑘 ; The main drawback of Apriori algorithm is the several scans of the database. Further researches evolved with a new concept of Karnaugh Map. A Karnaugh map provides a pictorial method of grouping together expressions with common factors and therefore eliminating unwanted variables. 10) for ( 𝑘 = 2 ; 𝐿 𝑖𝑘 ≠ Ø , 𝑖 = 1, 2, 3, … … . . , 𝑛 ; 𝑘 + +) do begin 11) 𝐶𝑘𝐺 = ∪ 𝑖=1 𝑡𝑜 𝑛 𝐿𝑘𝑖 12) end // Phase II 13) for 𝑖 = 1 to 𝑛 do begin 14) 𝑘𝑚𝑎𝑝 = 𝑘𝑚𝑎𝑝 + 𝑘𝑚𝑎𝑝𝑖 15) end 16) for all candidates 𝑐 𝜖 𝐶𝐺 compute 𝑠(𝑐) using 𝑘𝑚𝑎𝑝. 17) 𝐿 𝐺 = {𝑐 𝜖 𝐶𝐺 / 𝑠(𝑐) ≥ 𝑠𝑖𝑔𝑚𝑎 } 18) Answer = 𝐿𝐺 19) End Here in this paper, we are making an attempt to implement the concept of Genetic algorithm along with K-Apriori Algorithm, so that the number of database scans can be reduced to one, and redundant rules are also optimized. 6. Proposed Methodology 5.2 K-Apriori Algorithm Using the concept of K-Map, a new algorithm was developed to reduce the number of database scans to one. In this algorithm K-A priori, Karnaugh map is used to store the database transactions in reduced form which needs only one scan of the database and then the Apriori algorithm is used to identify frequent sets. But now, support count can be calculated directly from K-Map, so no further scanning of the database is required. K-map binary table for partition the data in table at 0 and 1. The divided process reduces the scanning time of the database. Algorithm: K-Apriori algorithm 1) Begin 2) Initialize : 𝑛 = number of partitions required 𝑁 = number of transactions to be analyzed 𝑚 = 𝑁/𝑛 // number of transactions in each partition// //Phase I 3) for 𝑖 = 1 to 𝑛 do begin 4) for 𝑗 = 1 to 𝑚 do begin 5) 𝑝 = read_partition (𝑇𝑗 in 𝑝𝑖 ) 6) end 7) 𝑘𝑚𝑎𝑝𝑖 = generate_𝑘𝑚𝑎𝑝(𝑝) 8) 𝐿𝑖 = Apriori (𝑘𝑚𝑎𝑝𝑖 ) 9) end // Merge phase We proposed a novel algorithm for optimization of association rule mining, the proposed algorithm resolve the problem of redundant association rule and also optimized the process of multi-pass of rules. Multi-pass of association rule mining is a great challenge for large datasets. In this paper we proposed K-Apriori-GA. In this algorithm K-map is used to logically partition of the dataset. The database divided into two sections, one is mapped data and another is unmapped data. The mapped data logically assigned 1 and untapped data logically assigned 0 for the scanning process. The divided process reduces the scanning time of the database. This algorithm combined the K-Aprioiri along with Genetic algorithm. The support weight key is a vector value given by the transaction data set. The support value passes as a vector for finding a near distance between K-map candidates key. After finding a K-map candidate key the nearest distance divides into two classes, one class take a higher order value and another class gain lower value for the rule generation process. The process of selection of class also reduces the passes of the data set. After finding a class of lower and higher of giving support value, compare the value of distance weight vector. This distance weight vector work as a fitness function for selection process of genetic algorithm. Here we present the steps of the algorithm. 1. Select data set 2. Put value of support and confidence 3. Logically divided dataset into two parts 0 and 1 4. 1 assigned to mapping part and 0 assigned to unmapped part 5. Start scanning of transaction table 6. Count frequent items 7. Generate frequent itemsets 8. Check the transaction set of data is null 9. Put the value of support as the weight 10. Compute the distance with Euclidean distance formula 11.Generate distance vector value for the selection process 12.Initialized a population set (𝑡 = 1) 13.Compare the value of distance vector with population set 14. If value of support greater than vector value 15. Processed for encoding of data 16. Encoding format is binary 17. After encoding offspring are performed 18. Set the value of probability for mutation and the value of probability is 0.006. 19. A set of rules is generated. 20. Check the K - map value of table 21. If the rule is not K-map goes into the selection process 22. Else optimized rule is generated. 23. Exit 7. Experiments and Results The experiment conducted on the real world dataset The National Rural Employment Guarantee Act (NREGA) department of rural development government of India. The objective of the act is to enhance livelihood security in rural areas by providing at least 100 days of guaranteed wage employment in a financial year to every household whose adult members volunteer to do unskilled manual work. Proper maintenance of records is one of the critical success factors in the implementation of NREGA. Information on critical inputs, processes, outputs and outcomes have to be meticulously recorded in prescribed registers at the levels of district program coordinator, program officer, gram Panchayat and other implementing agencies. The computer based management information system will also capture the same information electronically. To evaluate the algorithm used janpad Panchayat march 2012 records from sehore, district (M.P.). Details of the monthly squaring of accounts should be made publicly available on the internet at all levels of aggregation through the website (http://nrega.nic.in). In this database contains data pertaining to name of sub engineers, name of gram Panchayat, type of works, technical sanction, administrative sanction, cost of work, completion date, labor expenditure, material expenditure, total expenditure of work, physical report of work. To evaluate the efficiency of the proposed method we have extensively studied our algorithm's performance by comparing it with the standard Apriori algorithm as well as K-Apriori algorithm. In NREGA has many records but we randomly selected 731 transactions. This transaction presents 11 attributes and 731 instances. For theses experiments we have used 4 attributes and 20 instances. The experiment was executed on Intel (R) Celeron (R) M CPU 1.73GHZ machine and software was MATLAB 7.8.0, (2009). The below table represent the rule generation and execution time for particular support and confidence. In table 1 we have to calculate the Execution time of all three algorithms. Table 1 Attributes A B C D E Minimum Support 0.35 0.40 0.45 0.50 0.55 Minimum confidence Apriori algorithm 0.40 0.47 0.55 0.60 0.65 15.1 11.8 14.5 14.1 13.1 K-Apriori algorithm K-Apriori-GA algorithm 13.3 13.4 13.4 13.7 13.4 12.1 13.5 12.7 12.3 12.3 Comprative analysis of all three Algorithms 20 Execution Time Steps of algorithm: (K-Apriori-GA) Apriori algorithm 15 K-Apriori algorithm 10 5 0 A B C D E K-AprioriGA algorithn Min.Support Fig. 1 Comparative analysis of execution time In table 2 we optimized the rules of all three algorithms Table 2 Attributes A B C D E Minimum Support 0.35 0.40 0.45 0.50 0.55 Minimum confidence Apriori algorithm 0.40 0.47 0.55 0.60 0.65 226 177 154 140 123 K-Apriori algorithm K-AprioriGA algorithm 163 154 140 140 126 154 140 126 123 122 Comprative analysis of all three Algorithms No. of Rules 250 Apriori algorithm 200 150 K-Apriori algorithm 100 50 0 A B C D Min. Support E K-AprioriGA algorithn Fig. 2 Comparative analysis of the rules optimization process 8. Conclusion The aim of this paper is to enhance the performance of the standard Apriori algorithm that mine association rules by presenting a quick and scalable algorithm for discovering association rules in massive databases. The approach to achieve the required improvement is to make a lot of efficient new algorithm out of the standard one by adding new features to the Apriori approach. The proposed mining algorithm will efficiently discover the association rules between the data items in massive databases. Specially, redundant rules are eliminated and scanning time of the algorithm is reduced considerably. we tend to compared our propose algorithm to the previously proposed algorithm. The findings from totally different experiments have confirmed that our proposed approach is that the most effective among the others. It will speed up the data mining method considerably as demonstrated within the performance comparison. we tend to demonstrated the effectiveness of our algorithm using real world data sets. we developed a visual image module to produce users the helpful data concerning the database to be deep-mined and to assist the user manage and perceive the association rules. References [1] M.S. Chen, J. Han, and P.S. Yu, “Data Mining: An Overview from a Database Perspective,” IEEE Transactions on Knowledge and Engineering, Vol 8,no.6 December 1996 pp. 866-883. [2] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules between Sets of Items in Large Database,” Proceedings of the ACM SIGMOD Conference Washington DC, USA, May 1993. [3] R. Agrawal, and R. Srikant, “Fast Algoritms for Mining Association Rules,” Proceedings of the 20th VLDB Conference Santiago, Chile, 1994 pp. 487-499. [4] J.S. Park, M. S. Chen, and P.S. Yu, “An effective hash based algorithm for mining association rules,” Proceedings of the ACM SIGMOD International Conference on Management of Data, 1995 pp. 175-186. [5] A. Savesere, E. Omiecinski, and S. Navathe, “An efficient algorithm for mining association rules in large databases,” Proceedings of 20th International Conference on VLDB 1995. [6] D. Lin, and Z.M. Kedem, “pincer search: A New Algorithm for Discovring the Maximum Frequent Set,” September 1997. [7] S. Brin, R. Motwani, J. Ullman and S. Tsur, “Dynamic Itemset Counting and Implications Rules for Market Basket Data,” proceeding of the ACM SIGMOD conference on the Management of Data, 1997 pp. 255-264. [8] J. Han, and J. Pei, “Mining Frequent Patterns by Pattern-Growth: Methodology and Implications,” ACM SIGKDD Explorations Newsletter 2000 pp. 14-20. [9] A. Das, A., W.K. Ng, and Y.K. Woon, “Rapid Association Rule Mining,” Proceedings of the 10th international conference on Information and knowledge management, ACM Press, 2001 pp. 474-481. [10] R. Srikant and R. Agrawal, Mining Generalized Association Rules, Proceedings of the 21st VLDB Conference Zurich, Swizerland, 1995 pp. 407-419. [11] C. Hidber, 1999. “Online association rule mining”. In A. Delis, C. Faloutsos, and S. Ghandeharizadeh, editors, Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, volume 28(2) of SIGMOD Record, pp. 145–156. ACM Press. [12] W. Wang, J. Yang and P.S. Yu, “Efficient Mining of Weighted Association Rules (WAR),” Proceedings of the 6th ACM SIGKDD international conference on Knowledge discovery and data mining, 2000 pp. 270 - 274. [13] F. Bodon, “A Fast Apriori Implementation,” Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations, 2003 Vol. 90 of CEUR Workshop Proceedings. [14] H.F. Li, S.Y. Lee and M.K. Shan, “An Efficient Algorithm for Mining Frequent Itemsets over the Entire History of Data Streams,” Proceedings of the 1st International Workshop on Knowledge Discovery in Data Streams, 2004 pp. 20- 24. [15] G. S. Manku and R. Motwani, “Approximate Frequency Counts Over Data Streams,”. Proceedings of the 28th VLDB Conference,Hong Kong, China, 2002. [16] Y. Ye and C.C. Chiang, “A Parallel Apriori Algorithm for Frequent Itemsets Mining,” 4th International Conference on Software Engineering Research, Management and Applications, 2006 pp. 87- 94. [17] Y. Yuan, and T. Huang, “A Matrix Algorithm for Mining Association Rules,” Lecture Notes in Computer Science, Springer-Verlag Berlin Heidelberg , Vol. 3644, Sep 2005, pp. 370 – 379. [18] H. Wu, Z. Lu, L. Pan, R. Xu, and W. Jiang, “An Improved Apriori-based Algorithm for Association Rules Mining,” 6th International Conference on Fuzzy Systems and Knowledge Discovery,IEEE 2009 pp. 51-55. [19] S. Paul, “An Optimized Distributed Association Rule Mining Algorithm in Parallel and Distributed Data Mining with XML Data for Improved Response Time,” International Journal of Computer Science and Information Technology, Vol 2, No. 2, April 2010 pp. 88101. [20] P. Gautam, and K. R. Pardasani, “A Fast Algorithm for Mining Multilevel Association Rule Based on Boolean Matrix,” International Journal on Computer Science and Engineering (IJCSE), Vol. 2, No. 3,2010 pp. 746-752. [21] S. Ghosh, S. Biswas, D. Sarkar, and P.P. Sarkar, “Mining Frequent Itemsets Using Genetic Algorithm,” International Journal of Artificial Intelligence & Applications (IJAIA), Vol.1, No.4, October 2010 pp. 133-143. [22] S. Kamal, R. Ibrahim and Zia-ud-Din, “Association Rule Mining through Matrix Manipulation using Transaction Patternbase,” Journal of Basic & Applied Sciences, Vol. 8, 2012 pp. 187-195. [23] D. Oguz, “Dynamic Frequent Itemset Mining Based on Matrix Apriori Algorithm,” Master Thesis, The Graduate School of Engineering and Sciences of Izmir Institute of Technology, June 2012. [24] Cristofor, L., Simovici, D., Generating an informative cover for association rules. In Proc. of the IEEE International Conference on Data Mining, 2002. [25] Ashrafi, M., Taniar, D. Smith, K. A New Approach of Eliminating Redundant Association Rules, Lecture Notes in Computer Science, Volume 3180, 2004, Pages 465 – 474. [26] Ashrafi, M., Taniar, D., Smith, K., Redundant Association Rules Reduction Techniques, Lecture Notes in Computer Science, Volume 3809, 2005, pp. 254 – 263. [27] Thevar., R.E; Krishnamoorthy, R,” A new approach of modified transaction reduction algorithm for mining frequent itemset”, ICCIT 2008.11th conference on Computer and Information Technology. [28] F. H. AL-Zawaidah, and Y.H. Jbara, “An Improved Algorithm for Mining Association Rules in Large Databases,” World of Computer Science and Information Technology Journal (WCSIT), Vol. 1, No. 7, 2011 pp. 311-316. [29] P.P. Wakabi–Waiswa, V. Baryamureeba, and K. Sarukesi, “Optimized Association Rule Mining with Genetic Algorithms,” 7th International Conference on Natural Computation, IEEE, 2011 pp. 1116-1120. [30] N. Jain, and V. Sharma, “Distance Weight Parity Optimization of Association Rule Mining With Genetic Algorithm,” International Journal of Engineering Research & Technology (IJERT), Vol. 1, Issue 8, October – 2012 pp. 14. [31] S. Rangaswamy and Shobha G. “Optimized Association Rule Mining Using Genetic Algorithm,” Journal of Computer Science Engineering and Information Technology Research (JCSEITR) Vol.2, Issue 1 Sep 2012 pp.1-9. [32] Indira K. and Kanmani S. “Performance Analysis of Genetic Algorithm for Mining Association Rules,” International Journal of Computer Science (IJCSI) Issues, Vol. 9, Issue 2, No 1, March 2012 pp. 368-376.

Optimized Association Rules with Effective algorithm for Large

Related documents

Products

Support

Optimized Association Rules with Effective algorithm for Large

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib