Frequent Pattern Mining Algorithms: A

advertisement

Frequent Pattern Mining Algorithms: A Comparative Study

Mr Sahil Modak

Dept. of Computer Engineering,

Dwarkadas J. Sanghvi COE,

Vile Parle (W), Mumbai, India

sahilmodak1@gmail.com

ABSTRACT

Mining frequent patterns has been studied popularly in timeseries, transaction, and many other kinds of databases which has made it a focused theme in data mining research. The objective of frequent pattern mining is to find frequently appearing subsets in a given sequence of sets. Frequent pattern mining comes across as a sub-problem in various other fields of data mining such as association rules discovery, classification, market analysis, clustering, web mining, etc. Various methods and algorithms have been proposed for mining frequent pattern.

The efficiency of these algorithms has been a major issue since a long time and has captured the interests of a large researching community. This paper presents a comparative study on a few frequent mining techniques – Apriori, FP-Growth and H-Mine.

A brief description of each algorithm is also presented. Also, these pattern mining techniques are compared based on different important parameters which are accompanied by real life implementation results.

Keywords

Frequent Pattern Mining, Data mining, Algorithm Comparison,

Support, Itemsets.

1. INTRODUCTION

Mr Sagar Vikmani

Dept. of Computer Engineering,

Dwarkadas J. Sanghvi COE,

Vile Parle (W), Mumbai, India

sagar.vikmani@gmail.com

The process of discovering meaningful, new and interesting correlation, trends and patterns by sifting through large amounts of data, by using pattern recognition technologies as well as statistical and mathematical technique is referred to as Data mining. Finding association rules in database of sales transactions between items is the core process of data mining and it is the most popular technique has been studied by many researchers. Associate Rule Mining is a popular method for socalled market basket analysis aiming at discovering regularities in the shopping behaviour of customers of supermarkets, mailorder companies, on-line shops etc. In particular, it is tried to identify sets of products that are frequently bought together. A common strategy adopted by many Associate Rule Mining algorithms is to decompose the problem into two subtasks [8]:

1. Frequent Itemset generation: The objective is to find all the itemsets that satisfy a minimum support threshold.

2. Rule generation: The objective is to extract all the rules from the found frequent itemsets such that they have a high confidence value.

Frequent Itemsets: If support value of an itemset I in the transactional database is greater than the specified threshold value of ‘support’ than the itemset is frequent.

Support: A transaction T supports an itemset I if I is contained in transaction T. Support for an itemset I is defined as the ratio of the number of transactions containing I to the total number of transactions [9].

Prof. (Mrs) Lynette D’mello

Dept. of Computer Engineering,

Dwarkadas J. Sanghvi COE,

Vile Parle (W), Mumbai, India

lopeslynn@gmail.com

Confidence: Confidence measures the reliability of the interference made by a rule[8]. For a given rule X →Y, the higher the confidence, the more likely it is for Y to be present in the transactions that contain X.

The association rule mining [8] problem can be formally stated as: “Given a set of transactions T, find all the rules having support ≥ minsup and confidence ≥ minconf ; minsup and minconf being the respective thresholds for support and confidence.

Of the two decomposed subtasks, the second step is straightforward:

1.

For each frequent pattern p , generate nonempty subsets

2.

For every nonempty subset s , output the rule: s ⇒ ( p-s ) if confidence = support( p )/support( s

) ≥ minconf .

However, the first step is much more difficult. Hence, we focus on frequent pattern mining. The algorithmic aspects of Frequent

Pattern Mining (FPM) have been thus explored very widely.

Frequent Pattern mining aims to solve the problem of finding relationship among items in a database. The problem [9] can be stated as: “Given a database D with transactions T1 . . . TN, determine all patterns P that are present in at least a fraction S of the transactions.”

This paper thus puts forth a review of three distinct approaches to solve this problem.

2. VARIOUS FREQUENT PATTERN

MINING TECHNIQUES

2.1. Apriori algorithm

One of the early algorithms to grow as a frequent itemset mining technique was Apriori. It was given by R Aggarwal and R

Srikant in 1994 [1]. It works on horizontal layout based database. It is one of the most basic join based algorithms. This algorithm employs an iterative approach known as level-wise search. It is based on an important property called Apriori property ( Apriori principle ) to reduce the search. The principle [8] states that, if an itemset is frequent, then all of its subsets must also be frequent.

Conversely, if an itemset is infrequent, then all of its supersets must be infrequent too. For instance, if the pattern, {beer, chips, nuts} is frequent, so is the subset, {beer, chips}, i.e., every transaction having {beer, chips, nuts} also contains {beer, chips}.

The Apriori pruning principle restricts the algorithm to generate or test the supersets of any itemset which is found to be infrequent. Such a support-based pruning strategy of trimming the exponential search space is made possible by the property of the support measure, namely the anti-monotone property that states that the support for an itemset cannot exceed the support for its subsets.

The pseudo code [8] for part of the Apriori algorithm that generates the frequent itemset is as shown. Let the set of candidate k-itemsets and the set of frequent k-itemsets be denoted by C k and F k

respectively:

Initially, only a solitary pass over the entire data set is made to determine the support of each item. After concluding this step, the set of all frequent 1-itemsets, F

1

, will be identified (steps 1- 2).

The algorithm will then generate new candidate k-itemsets iteratively using the frequent (k

— 1)-itemsets found in step 5. The ‘apriorigen’ function then then implements candidate generation.

1: k = 1.

2: F k

={ i | i I ˄ ({i) ≥ N x minsup }.

3: repeat

4: k = k + 1 .

5: C k

= apriori-gen(F k

-l).

6: for each transaction t T do

7: C t

= subset( C k

, t).

8: for each candidate itemset c C t do

9: (c) = r(c)+ 1.

10: end for

11: end for

12: F k

= ( c | c C k

˄ ( c ) ≥ N x minsup }.

13: until F k

= 0

14: Result = F k

.

Two data scans are necessary to construct a FP-Tree [5]. Each item’s support value is found in the initial scan. During the second scan, the previously calculated support values are used to sort the items within transactions in descending order. Now, if any of the two transactions have a same shared prefix, the common portion is fused and accordingly the frequencies of the nodes are incremented. Item links, connecting the nodes with same labels, facilitates frequent pattern mining. Additionally, each FP-Tree has a header containing all frequent items and pointers to the beginning of their respective item links. Based on the prefixes, the FP-Tree is partitioned by FP-growth.

Traversing the paths of FP-Tree iteratively, FP-growth generates frequent item sets. It then, concatenates pattern fragments to ensure proper generation of all frequent item sets.

Thus apart from being faster, FP-growth sidesteps the expensive operations of generating and testing operations of candidate itemsets, and in the process, it overcomes the major disadvantages of Apriori algorithm. The major cost of FPgrowth is involved with the recursive construction of the FPtrees [6]. However, the advantages of this algorithm are constrained: FP-growth is more effective in dense databases than in sparse databases. When fed with sparse databases, FP-

Tree manages to achieve only a small compression and a bushy

FP-Tree. Consequently, FP-growth applies a lot of effort to join fragmented patterns and finding no frequent item sets [7].

2.3. H-Mine algorithm

The H-mine algorithm was designed as an improvement over the

FP-tree algorithm. This algorithm creates a projected database using in memory pointers. The H-mine algorithm uses a new data structure - H-struct or Hyperlinked structure - for mining purpose. The set of item projections arising frequently for a transaction database (TDB) [3] are contained within this

Hyperlinked structure. The items present in the frequent-item projections are represented by two parameters, viz, an item identification number and a hyper link. A Hyperlinked structure contains a header table. The frequent items are contained in this

Header table in the form of an array and arranged in the order of the F-List. Each item in a header table is connected to a support count and a hyper-link.

The algorithm makes an additional pass over the data set to count the support of candidates (steps 6-10). The subset function determines all candidate itemsets in C k enclosed in each transaction t.

After counting their supports, the algorithm eliminates all candidate itemsets that have a support count value less than minsup (step 12).

When there are no new frequent itemsets generated, the algorithm terminates (step 13).

By iteratively reducing the candidate itemsets, Apriori algorithm achieves good performance. The issue with Apriori however is that it entails k data scans to find every frequent k-item set.

Hence, the algorithm reaches a bottleneck easily when the length of the largest frequent itemset is relatively long as it then needs to generate huge candidate sets resulting in a dramatic performance decrease [6].

2.2. FP-growth algorithm

Frequent pattern growth also labelled as FP-growth is a tree based algorithm to mine frequent patterns in database, the idea of which was first presented by Han, Pei & Yin in the year 2000

[2]. FP-growth takes a radically different approach to discovering frequent itemsets; the algorithm does not subscribe to the generate-and-test paradigm of Apriori. Instead, it encodes the data set using a compact data structure called an FP-tree and extracts frequent itemsets directly from this structure without generating candidate frequent item sets using divide and conquer method. FP-Tree, an extension of prefix tree structure, only stores the frequent items. Every node in the tree contains the label and the frequency of the item. The paths from the root to the leaves are set corresponding to the support value of the items such that the frequency of a parent is greater than or equal to the sum of the frequencies of its children.

Figure 1: Hyperlinked structure example

The hyper-structure as shown in the above figure is an example of a Hyperlinked Structure. The Hyperlinked structure stores the frequent-item projections as well as a header table whose size is restricted by the number of frequent items, for every transaction.

This is all the space required by the Hyperlinked structure. Thus, the space requirement of the Hyperlinked structure is given by

(trans ∈ TDB |freq (trans)|), where TDB is the transaction database and freq (trans) is a frequent-item projection of a transaction t . Also, only two scans of a transaction database are required to build a Hyperlinked structure.

The steps of the algorithmic implementation of H-Mine can be summarized as:

1.

The transaction database is scanned once to find the complete set of frequent items L .

2.

The transaction database is partitioned into k parts, TDB

1

,

..., TDB k

, in such a way that, for each TDB i

(1 ≤ i ≤ k), the main memory holds the frequent-item projections in TDB i.

3.

Using H-Mine, mine frequent patterns in TDB i relating to the threshold minsup i

= [ minsup x n i

/n ], for i = 1 to k, such that n represents the number of transactions in TDB and n i denotes that in TDB i

. Let the set of frequent patterns in

TDB i be given by F i

.

4.

Let F I i . TDB is scanned one more time and support for patterns in F is collected. Output the patterns which have a support value greater than minsup threshold .

The H-mine algorithm has polynomial space complexity and as a result of this more efficient in terms of space then the FP-

Growth algorithm. The H-mine algorithm is also designed for fast mining purpose. For the large databases, the H-mine algorithm partitions the database initially after which it mines each partition in the main memory using the Hyperlinked structure and then consolidates global frequent pattern (GFP). If the dataset on which the H-mine algorithm applied is dense, then the H-mine algorithm integrates with the FP-Growth algorithm and constructs the FP-tree by detecting the swapping condition dynamically.

This integration of the H-mine algorithm ensures that it is scalable for both large and medium size databases as well as for sparse and dense datasets. The major advantage of using inmemory pointers is that the projected database uses no memory space and the memory requirements are for storing the inmemory pointers. However, since the H-struct unlike the FP-

Tree does not facilitate any compression, H-Mine is not as competent as FP-growth in a dense database scenario [7].

3. COMPARISON OF FREQUENT

PATTERN MINING TECHNIQUES

Comparison of the three FPM algorithms has been done, where they are compared against various different parameters. The experimental results demonstrate the comparison between the sensitivity of the algorithm towards the change in user parameters (support) with respect to time and memory consumptions.

Mushroom and Chess are the datasets [7] used in run-time and space experiments. Following are the datasets which were taken:

Table 1: Description of the dataset

Dataset

Number of

Instances

Number of

Attributes

Chess 67557 42

Mushroom 8124 22

Chess - all legal 8-ply positions in the game of connect-4 are contained in this database wherein, either player has not won yet and no next move is forced. The outcome is a abstract value set for the first player. The Apriori, H-Mine and FP-growth run out of time when minsup value is less than 60%, 40% and 20% respectively.

Mushroom - This data set encompasses descriptions of hypothetical samples corresponding to 23 gilled mushrooms species in the Agaricus and Lepiota family are categorized as either edible or poisonous centred on physical characteristics.

Table 6: Summary of frequent pattern

mining algorithms

Algorithms

Parameters

Apriori FP-Growth H-mine

Description

Uses

Apriori property, join and prune method

Constructs conditional frequent pattern tree and conditional pattern base from database which satisfy the minimum support.

Partitions and projects the database.

Uses hyperlink pointers to store this database into main memory.

Data Structure Tree Tree

Advantages

Disadvantages

Array

Easy to implement.

In mining frequent pattern, suitable for both sparse and dense database

Large number Of database scan, space and time complexity is high.

Uses only 2 scan of database, suitable for large and medium datasets

Recursive construction of the FPtrees and complex data structure required.

Better memory utilization, suitable for sparse and dense datasets

Time required is larger than others because of partitioning of the database.

4. EXPERIMENTAL RESULTS

1.

Execution time analysis for Mushroom dataset:

Apriori FP-Growth H-Mine

150000

100000

50000

0

10 20 30 40 50 60 70 80 90 100

SUPPORT (%)

Figure 2: Graph of comparison for Mushroom dataset

Support (%)

10

20

30

40

50

60

70

80

90

100

Execution Time ( in milliseconds)

Apriori FP-Growth

141710 349

H-Mine

13076

6793

547

125

63

39

32

2254

328

140

47

16

32

15

16

0

32

31

25

31

24

15

62

47

32

31

15

15

Table 2: Result of comparison for Mushroom dataset

2.

Memory usage analysis for Mushroom dataset :

Apriori FP-Growth H-Mine

400

300

200

100

0

10 20 30 40 50 60 70 80 90 100

SUPPORT (%)

Figure 3: Graph of comparison for Mushroom dataset

Table 3: Result of comparison for Mushroom dataset

Support (%)

10

20

Memory usage ( in megabytes)

Apriori FP-Growth

273.43

143.65

151.41

198.46

H-Mine

355.75

328.19

30

40

50

60

70

80

90

100

164

182.58

200.35

218.11

235.86

16.13

33.71

0

235.43

268.54

270.09

78.35

166.19

197.24

132.85

139.07

297.16

251.27

288.12

208.70

164.89

141.63

116.05

153.02

3.

Execution time analysis for Chess dataset:

300000

Apriori FP-Growth H-Mine

200000

100000

0

10 20 30 40 50 60 70 80 90 100

SUPPORT (%)

Figure 4: Graph of comparison for Chess dataset

Table 4: Result of comparison for Chess dataset

Memory usage ( in megabytes)

Support (%) Apriori FP-Growth H-Mine

10 - - -

20

30

40

-

-

-

283663

36328

6039

-

-

325988

50

60

70

80

90

-

81881

10381

1640

78

1347

313

78

31

16

77544

18595

3953

766

93

32 100 0 15

4.

Memory usage analysis for Chess dataset:

Apriori FP-Growth H-Mine

400

300

200

100

0

10 20 30 40 50 60 70 80 90 100

SUPPORT (%)

Figure 5: Graph of comparison for Chess dataset

Table 5: Result of comparison for Chess dataset

Support (%)

Memory usage ( in megabytes)

Apriori FP-Growth H-Mine

10

20

30

40

50

-

-

-

-

-

-

115.08

212.79

166.33

140.13

-

-

-

375.15

375.39

60

70

80

90

100

321.44

194.68

220.95

137.33

0

204.92

271.58

326.45

162.59

112.63

342.05

334.68

333.38

249.07

90.69

5. CONCLUSION

Association rule mining and frequent pattern mining is currently a wide field for researchers owing to their wide applicability.

Association rule mining has a wide range of applicability such as cross marketing, market basket analysis, medical diagnosis and research, Website navigation analysis, fraud detection and so on. Numerous techniques have been put forward for mining frequent patterns. This paper presented an overview of three diverse techniques of frequent pattern mining that can be used in different ways to generate frequent itemsets.

Every method has its own pros and cons. Performance of a specific technique is contingent on the available resources and input data. The analysis of graphs shows that FP-growth algorithm performs exceedingly well provided that the dataset is dense but, as soon as the dataset goes sparse, Apriori outperforms FP-growth. The execution time is decreased when the support threshold gets increased. The time performance of

FP-growth and H-Mine algorithm is approximately same for higher value of support. The memory performances are however erratic and vary corresponding to the number of itemset in a transaction and the support value. The algorithms were compared by implementing them on a Java based platform, but the results however could vary depending on the tools, programming language used and the base machine’s architectural specifications.

REFERENCES

[1] R. Agrawal and R. Srikant. “ Fast algorithms for mining association rules in large databases

. Research Report RJ

9839, IBM Almaden Research Center, San Jose, California,

June 1994.

[2] Jiawei Han, Jian Pei, Yiwen Yin, Runying Mao: “ Mining

Frequent Patterns without Candidate Generation: A Frequent-

Pattern Tree Approach

. Data Min. Knowl. Discov. 8(1): 53-87

(2004)

[3] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang. " H-

Mine: Fast and space-preserving frequent pattern mining in large databases ".

IIE Transactions, Volume 39, Issue 6, June

2007, Taylor & Francis.

[4]. Sukhbir Singh, Dharmender Kumar, “Frequent Pattern

Mining Algorithms: A Review”.

[5] Deepak Garg, Hemant Sharma, Comparative Analysis of

“ Various Approaches Used in Frequent Pattern Mining.”,

(IJACSA) International Journal of Advanced Computer Science and Applications,

Special Issue on Artificial Intelligence.

[6] Charu Aggarwal, Jiawei Han, “ Frequent Pattern Mining”.

[7] Dataset Repository

“Frequent

Itemset Mining

Implementations Repository ( http://fimi.ua.ac.be/data

)”

[8] Pang-Ning Tan, Michael Steinbach, Vipin Kumar

“Introduction to Data Mining”

[9] Amit Mittal1 , Ashutosh Nagar2 , Kartik Gupta3 , Rishi

Nahar4 “ Comparative Study of Various Frequent Pattern

Mining Algorithms

”, International Journal of Advanced

Research in Computer and Communication Engineering Vol. 4,

Issue 4, April 2015

Download