Pattern Finding Techniques: A review Aditi Gupta Seema Maitrey Department of Computer Science and Engineering Department of Computer Science and Engineering Krishna Institute of Engineering And Technology Ghaziabad Krishna Institute of Engineering And Technology Ghaziabad Guptaaditi91@gmail.com Seema.maitrey@gmail.com ABSTRACT Pattern mining is the most important step of Data Mining task and Knowledge Discovery process. Pattern mining is finding the patterns in data, patterns can be in form of subsequence, substructures or set of items etc. Specially finding the frequent patterns in data is most important. And for this, first research was done by Agrawal et al. (1993) [1] in form of association rules. In this paper, we discuss about available methods and techniques for finding maximal frequent patterns. Here, we will start from first research association rules and then after there are so many extensions for this rule. INTRODUCTION There are so many algorithms and techniques given by the various researchers for finding the pattern. Pattern finding is the very important and the crucial task in data mining and Knowledge discovery process (KDD). Pattern mining is one of the step of the data mining and KDD process. There are various algorithms that can find the frequent patterns in large datasets very easily. There is so much research work which defines various methods briefly. Finding frequent patterns is most important task. Frequent patterns are the patterns that occur frequently in database and patterns can be subsequences, substructures or set of items. Initially, Method for finding frequent pattern was given by Agrawal et al. (1993) [1] for the market basket analysis. Frequent pattern mining has its very important use in many tasks like classification, correlation, and cluster analysis and in various areas of data mining. Agrawal et al. (1993) [1] gave the association rule for finding the patterns. They gave the analysis work for transaction in market basket analysis. In this, analysis of the patterns is done that means which item is frequently purchased with which item , so that a shopkeeper can arrange its shop and which leads to increase in sale. TECHNIQUES FOR PATTERN MINING First, association rules was given by Agrawal et al. (1993) [1] in which there is a concept and support and confidence and all working is done by using min. support threshold that means pattern which are frequent must have support greater than min. support threshold. They gave the very basic Apriori algorithm for finding the pattern. And then after various researchers gave the extension of the algorithm. BASIC ALGORITHMS Apriori algorithm and its extensions are the basic algorithm was first given by Agrawal et al. (1993) [1]. In Apriori algorithm, first scan the database and determine 1-itemset and repeat this step until there is large candidate itemset is obtained. And then check the support of all candidate itemsets and remove all those itemsets whose support is small. Apriori algo generates candidate generation for finding the patterns which is very costly and time consuming for large databases. SO the frequent pattern tree (FP Tree) was proposed by Han et al. (2000) [2] BY this approach, efficiency can be achieved in three ways: i. It uses the Pattern fragment growth approach so that it can’t take so much cost for large databases. ii. Divide and conquer approach is used for finding the patterns which divides the big task into smaller tasks. iii. Large database is converted into compress form(FP tree Datastructure) which avoids the costly and time consuming database scan . FP-growth is at least an order of magnitude faster than Apriori. From the FP-tree construction process, we can see that one needs exactly two scans of the transaction database, DB: the _rst collects the set of frequent items, and the second constructs the FP-tree. The cost of inserting a transaction Trans into the FP-tree is O(jTransj), where jTransj is the number of frequent items in Trans. We will show that the FP-tree contains the complete information for frequent pattern mining. And after constructing the FP Tree we will find the pattern by using Node-link property and Prefix path property these two properties are done repeatedly by using FP-growth algo. FINDING MAXIMAL FREQUENT ITEMSETS Association rules are used for finding the relation between items for large database. For finding the itemsets all algo uses the levelwise bottom-up approach or bottom-up and top-down together. All algo uses the idea of subset lattice or itemset lattice. So for finding the frequent closed itemsets new approach was given by Pasquier et al. (1999) [3], which is A-Close. For this they gave the concept of closed itemset lattice which is smaller than itemset lattice and also known as concept lattice because it is somewhat related to Galois lattice. So once frequent closed itemsets are determined then we can easily find frequent item sets. So this reduces computational cost. Using the A-Close, we can find the reduced set of frequent item set and valid set of association rule. Apriori involves finding the frequent itemsets. Apriori algorithm uses the bottom-up search in which this search all frequent itemsets one by one. So the complexity of this can be calculated in exponential manner and this can be used only for short patterns. To solve this problem new algorithm max-miner was proposed by Bayardo et al. 1998 [4] which can find the maximal frequent itemsets .Max-miner is very successful algorithm because it uses “look-ahead” instead of bottomup search. Max-miner uses the pure breadth-first search in setenumeration tree which reduces the no. of passes in data. Max-miner uses the pruning as in apriori that is pruning based on subset in frequency and also based on superset frequency. In max-miner approach item ordering policies are used so that effectiveness of superset –frequency pruning can be increased. This uses the same data structure as in Apriori. Max-miner uses the hash tree .Hash tree is also used for finding the subsets of frequent itemsets. And in second pass, 2-D array for fast computation of support of itemsets. PATTERNS FROM DIMENSIONAL DATASETS HIGH- The growth of bioinformatics pose a great challenge for finding the frequent item-sets in data using pattern discovery algorithm because they have exponential dependency on average row length. In comparison with transactional data, microarray data has less no. of rows (samples) but have large no. of column that is genes. So the new algorithm was given by Pan et al. (2003) [5] named as CARPENTER. It handles the data which has so many attributes and relatively small no. of rows. So many algorithms was developed whose running time increases exponentially with increase in average row length so that these are very impractical algorithm for high dimensional data. So, CARPENTER is a good algorithm for discovering the frequent closed patterns using depth-first row-wise enumeration. Unlike other CARPENTER performs search by enumeration of row sets. By imposing a total order, such as lexicographic order on the row sets, we are able to perform a systematic search for closed patterns. In this whole data is maintained into table and transpose of this table is done, and the entries of the transpose tables are used as tuples and entries of the original table are used as rows. When the database is transposed, infrequent pattern item will be removed. This step will take less time to fit in main memory. CARPENTER method recursively generates the conditional transposed tables. For generating the tables, it uses the subroutine MINEPATTERN. And then the pruning is done. In this pruning method stop the useless traversal of the tree. Then, next came row enumeration method in which topdown search is done which takes the full advantage of pruning power of minimum support threshold for reducing the search space. Using this type of strategy, algorithm TD_Close is given by Liu et al. (2006) [6] which finds the frequent closed patterns. And also a closeness checking is used to avoid multiple scanning of the data. Row enumeration method is used to handle high dimensional data. Performance of row enumeration method is fast than the column enumeration method. Bottom-up search finds the row combination from smallest to largest so, it can’t make full use of minimum support threshold. So the strategy top-down is used with row –enumeration method. This td-close algorithm takes less memory than other methods. Memory for bottom search is costly. IN td-close algorithm, first start with the table and making its transpose table. And then initialize the set of frequent closed pattern and then takes ExcludedSize as 0. , TopDownMine method is used for finding frequent itemsets. And then different pruning strategies are used for different condition and this TopDownMine method is recursively used. FINDING PATTERNS STRUCTURAL Many databases are represented by structural information for better understanding. Finding the interesting substructure is important task that better represent the interpretation of the data. Structure can be represented as labeled graph. There are so many systems given by different researchers. Researchers also extended the clustering system to the structured objects and data. For finding the substructure, there is a SUBDUE system which was given by Holder et al. (1994) [8]. This system uses the MINIMUM DESCRIPTION LENGTH (MDL) principle. MDL is used in various fields of computer science like in image processing, decision tree induction etc. SUBDUE system takes the labeled graph as an input. There is an algorithm which finds the substructure, this also includes the various cost like distortion cost etc. And this cost can vary upto a threshold value. SUBDUE discovery algorithm starts with the finding the substructure with single vertex and iteratively all possible substructure are discovered. And cost is also checked and MDL principle is used to find best suited substructure. Although MDL principle is used still SUBDUE system uses the background knowledge and this background knowledge is used in the form of some rules to find the substructure. Compactness and coverage is also used as rules in SUBDUE system. After the substructure is found then the all instance is taken as single vertex and similarly it is converted into compress data which gives the appropriate information and form the hierarchy of the concepts which improves the performance of the SUBDUE system. Finding patterns in chemical compound is very challenging task for medical science. Finding the Carcinogenicity of chemical compound is a complex task. This research gives more contribution to data mining field, because cancer is most crucial thing to cause death. Toxicity of the chemical compounds can be found by finding the patterns by using the association rules of data mining. Dehaspe et al. (1998) [9] gave the contribution towards this problem. He used the concept of DATALOG and also uses the algorithm WARMR which uses the levelwise search. FINDING PATTERNS BASED ON SOME CONSTRAINTS Although frequent patterns can be found by using the association rule but there is a problem that users want some specified patterns based on some constraint. Based on constraint mining is known as constrained based mining. In association rule mining, there are problems like lack of relationship, focus and user control etc. An architecture based on human exploratory research was given by Ng et al 1998 [10]. This architecture has a set of constraint construct having domains, classes etc., and also proposed constrained association queries. They developed the category of the constraint according to the property antimonotonicity and succinctness and then develop the mining algorithm CAP which maximizes the degree of pruning for all categories. Architecture is a two-phase exploratory mining .And this architecture is a downward compatible. This uses the various algorithms for finding the frequent sets and they are APRIORI+ and HYBRID (m). But these algorithms fail to satisfy the property of antimonotonicity and succinctness. So the algorithm CAP was given. Frequent itemsets or patterns are used for classification, so the association rules are useful for finding the patterns for classification purpose. In this context, Dong G, Li J (1999) gave some useful pattern finding method. He found some useful patterns, emerging pattern (EPs). EPs are useful for finding the classifiers and have itemsets whose support changes from one dataset to another. Apriori algorithm is not useful for finding the EPs and it is too costly to find EPs for large or high dimensional databases. For solving this problem, Dong G, Li J (1999) [11] defined the large itemset by using their concise borders. And EP mining algorithm which uses the border and for finding the border they uses border-differential algorithm which uses the MAX-MINER algorithm. EPs are useful for finding the pattern in business and demographic data and when they are applied to data with classes gives the significant result. EPs are defined by the growth rate which is support of the one dataset D2 to the support of other dataset D1. A cart based method [12] for finding the emerging pattern is used in medical data etc. This Cart based method uses decision tree for finding the EPs and Fisher’s exact test is used to check the EPs. Maximum-likelihood linear discriminant analysis is also used. SEQUENTIAL PATTERN MINING Sequential pattern mining was discovered by Agrawal and Srikant (1995) [13] and this is important challenge of data mining task. Sequential pattern mining is the mining of ordered events without concrete notion of time as patterns for example customer shopping. Sequential pattern mining is done in sequence database where sequence means a list of transaction ordered by time and transaction contains set of items. So the problem is to find sequential patterns with minimum support threshold. These patterns can be used in finding disease means in medical science field. In Agrawal and Srikant (1995), there were some limitations: i. ii. iii. Users want to specify the maximum and minimum time gapes between events but this is not possible in Agrawal and Srikant (1995). There was a restriction that items must come from same transaction. Allows all items from all levels of hierarchy. So the problem is solved in Srikant and Agrawal (1996) [14] by using time constraints, sliding time windows and hierarchies (taxonomies). For this purpose they gave GSP algorithm (Generalized Sequential Patterns).They gave 3 algorithms, in which two are not so much useful because they only find maximal sequential pattern. But third APRIORIAll is useful than other two. This is computationally expensive and also it is useful in time constraints and taxonomies but doesn’t in sliding time windows. So. The GSP is extension of AprioriAll and handles all problems linearly. In GSP, candidate sequence generation is done as same as in simple Apriori algorithm but satisfying the condition of sliding window, time constraints and taxonomies. It also uses the data structure Hash tree for reducing the counting. Pei J, Han J, Mortazavi-Asl B, Pinto H, Chen Q, Dayal U, Hsu M-C (2001) [15] gave PrefixSpan method for finding the sequential patterns. CONCLUSION In this paper, we discuss about the research work of the various researchers in field of data mining to find patterns. But it is possible that there is not discussed all available methods and techniques because from the first research till today, so much research work has been done. May be paper will help you to understand the pattern mining and its technique. But so much work can be done in field of data mining. REFERENCES Proceeding of the 2000 ACM-SIGMOD international workshop data mining and knowledge discovery (DMKD’00), Dallas, TX, pp 11–20 [8] HolderLB, Cook DJ,Djoko S (1994) Substructure discovery in the subdue system. In: Proceeding of the AAAI’94 workshop knowledge discovery in databases (KDD’94), Seattle, WA, pp 169–180 [9] Dehaspe L, Toivonen H, King R (1998) Finding frequent substructures in chemical compounds. In: Proceeding of the 1998 international conference on knowledge discovery and data mining (KDD’98), New York, NY, pp 30–36 [1] Agrawal R, Imielinski T, and Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993ACMSIGMODinternational conference on management of data (SIGMOD’93), Washington, DC, pp 207–216. [10] Ng R, Lakshmanan LVS, Han J, Pang A (1998) Exploratory mining and pruning optimizations of constrained associations rules. In: Proceeding of the 1998 ACM-SIGMOD international conference on management of data (SIGMOD’98), Seattle, WA, pp 13–24 [2] Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceeding of the 2000 ACM-SIGMOD international conference on management of data (SIGMOD’00), Dallas, TX, pp 1– 12 [11] Dong G, Li J (1999) Efficient mining of emerging patterns: discovering trends and differences. In: Proceeding of the 1999 international conference on knowledge discovery and data mining (KDD’99), San Diego, CA, pp 43–52. [3] Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceeding of the 7th international conference on database theory (ICDT’99), Jerusalem, Israel, pp 398–416 [12] Anne-Laure Boulesteix, Gerhard Tutz, and Korbinian Strimmer (2003) A CART-based approach to discover emerging patterns in microarray data. [4] Bayardo RJ (1998) Efficiently mining long patterns from databases. In: Proceeding of the 1998 ACMSIGMOD international conference on management of data (SIGMOD’98), Seattle,WA, pp 85–93 [5] Pan F, Cong G, Tung AKH, Yang J, Zaki M (2003) CARPENTER: finding closed patterns in long biological datasets. In: Proceeding of the 2003 ACMSIGKDD international conference on knowledge discovery and data mining (KDD’03),Washington, DC, pp 637–642 [6] Liu H, Han J, Xin D, Shao Z (2006) Mining frequent patterns on very high dimensional data: a topdown row enumeration approach. In: Proceeding of the 2006 SIAM international conference on data mining (SDM’06), Bethesda, MD, pp 280–291 [7] Pei J, Han J, Mao R (2000) CLOSET: an efficient algorithm for mining frequent closed itemsets. In: [13] Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of the 1995 international conference on data engineering (ICDE’95), Taipei, Taiwan, pp 3–14 [14] SrikantR, AgrawalR (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceeding of the 5th international conference on extending database technology (EDBT’96), Avignon, France, pp 3–17 [15] Jiawei Han, Hong Cheng, Dong Xin, Xifeng Yan Frequent pattern mining: current status and future Directions. [16] Pei J, Han J, Mortazavi-Asl B, Pinto H, Chen Q, Dayal U, Hsu M-C (2001) PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceeding of the 2001 international conference on data engineering (ICDE’01), Heidelberg, Germany, pp 215–224