High Utility Item Set Mining Mr. Hemant Narorottam Chaudhari

advertisement

International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 3 - Jun 2014

High Utility Item Set Mining

Mr. Hemant Narorottam Chaudhari

#1

, Mr. Gajendra Singh Chandel

*2

, Mr. Kailas Patidar

#3

1

M.Tech (Software Engineering) Student, Sri Satya Sai Institute of Science and Technology, Sehore (M.P.), India.

2

HOD (CS/IT), Sri Satya Sai Institute of Science and Technology, Sehore (M.P.), India

3

Assistant Professor, Sri Satya Sai Institute of Science and Technology, Sehore (M.P.), India.

Abstract— Data Mining can be defined as an activity that extracts some new nontrivial information contained in large databases. Traditional data mining techniques have focused largely on detecting the statistical correlations between the items that are more frequent in the transaction databases. Also termed as frequent itemset mining, these techniques were based on the rationale that itemsets which appear more frequently must be of more importance to the user from the business perspective .In this paper we throw light upon an emerging area called Utility

Mining which not only considers the frequency of the itemsets but also considers the utility associated with the itemsets. The term utility refers to the importance or the usefulness of the appearance of the itemset in transactions quantified in terms like profit, sales or any other user preferences.

Keywords— Data mining, High Utility Item Set Mining, Frequent

Itemset Mining, Utility Associated, KDD. utility values higher than a threshold .In utility based mining the term utility refers to the quantitative representation of user preference i.e. the utility value of an itemset is the measurement of the importance of that itemset in the users perspective. For e.g. if a sales analyst involved in some retail research needs to find out which itemsets in the stores earn the maximum sales revenue for the stores he or she will de-fine the utility of any itemset as the monetary profit that the store earns by selling each unit of that itemset.

I.

I NTRODUCTION

This is the important part of KDD. Data mining generally involves four classes of task; classification, clustering, regression, and association rule learning. Data mining refers to discover knowledge in huge amounts of data. It is a scientific discipline that is concerned with analysing observational data sets with the objective of finding unsuspected relationships and produces a summary of the data in novel ways that the owner can understand and use. Data mining as a field of study involves the merging of ideas from many domains rather than a pure discipline the four main disciplines [7], which are contributing to data mining include:

• Statistics: it can provide tools for measuring significance of the given data, estimating probabilities and many other tasks (e. g. linear regression).

• Machine learning: it provides algorithms for inducing knowledge from given data (e. g. SVM).

• Data management and databases: since data mining deals with huge size of data, an efficient way of accessing and maintaining data is necessary.

• Artificial intelligence: it contributes to tasks involving knowledge encoding or search techniques (e. g. neural networks).

Utility Mining The limitations of frequent or rare itemset mining motivated researchers to conceive a utility based mining approach, which allows a user to conveniently express his or her perspectives concerning the usefulness of itemsets as utility values and then find itemsets with high

Figure1: Data Mining

Here note that the sales analyst is not interested in the number of transactions that contain the itemset but he or she is only concerned about the revenue generated collectively by all the transactions containing the itemset. In practice the utility value of an itemset can be profit, popularity, page rank, measure of some aesthetic aspect such as beauty or design or some other measures of user’s preference.

Formally an itemset S is useful to a user if it satisfies a utility constraint i.e. any constraint in the form u(S) >= minutil, where u(S) is the utility value of the itemset a minutil is a utility threshold defined by the user. In our example if we take utility of an itemset as the unit profit associated with the sale of that itemset then with utility threshold minutil = 500 then the itemset ABC has a utility value of 555 which means that this itemset is of interest to the user even though its support value is just 20%.Since while considering the total utility of an itemset S we multiply the utility values of the individual items consisting the itemset S with the corresponding frequencies of the individual items of S in the transactions that contain S, so the utility based mining approach can be said to be measuring the significance of an itemset from two dimensions. The first dimension being the support value of the

ISSN: 2231-5381 http://www.ijettjournal.org

Page 149

International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 3 - Jun 2014 itemset i.e. the frequency of the itemset and the second dimension is the semantic significance of the itemset as measured by the user. Recent work has highlighted the importance of constraint based itemset mining in which the user has the privilege to specify his or her preferences by defining constraints that capture the semantic significance of the itemset in the intended application domain..

II.

E XISTING T ECHNIQUES

Extensive studies have been proposed for mining frequent itemsets. One of the well-known algorithms is Apriori algorithm [1], which is the pioneer for efficiently mining association rules from large databases. The tree-based approaches such as FP-Growth [3] were afterward proposed.

It’s widely recognized that FP-Growth achieves a better performance than Apriori-based approaches since it finds frequent item-sets without generating any candidate itemset and it scans database just twice.

However, in the framework of frequent itemset mining [1,

3], the importance of items to users is not considered. The unit profits and purchased quantities of the items are not taken into considerations. Thus, some methods were proposed for mining high utility itemsets from the databases, such as

UMining [6], Two-Phase [5], IIDS [4] and IHUP [2].

UMining algorithm [6] proposed by Yao et al. used an estimation method to prune search space. Although it is shown to have good performance, it cannot capture the complete set of high utility itemsets since some high utility patterns may be pruned during the process.

Two-Phase algorithm [5] proposed by Liu et al. consists of two phases. In phase I, Two-Phase algorithm employs a breadth first search strategy to enumerate HTWUIs. It generates candidate itemsets of length k from HTWUIs of length (k-1) and prunes candidate itemsets by TWDC property.

In each pass, HTWUIs and their estimated utility values i.e.,

TWUs, are computed by scanning database. After that, the complete set of HTWUIs is collected in phase I. In phase II, high utility itemsets and their utilities are identified from the

HTWUIs by scanning original database once.

Although Two-Phase algorithm effectively reduces the search space by TWDC property and captures the complete set of high utility itemsets, it still generates too many candidates for HTWUIs and requires multiple database scans.

To overcome this problem, Li et al. [4] proposed an isolated items discarding strategy, abbreviated as IIDS, to reduce the number of candidates. By pruning isolated items during the level-wise search, the number of candidate itemsets for

HTWUIs in phase I can be reduced effectively. However, this approach still scans database multiple times and uses a candidate generation-and-test scheme to find high utility itemsets.

To efficiently generate HTWUIs in phase I and avoid scanning database multiple times, Ahmed et al. [2] proposed a tree-based algorithm, called IHUP, for mining high utility itemsets. They use an IHUP-Tree to maintain the information of high utility itemsets and transactions. Every node in IHUP-

Tree consists of an item name, a support count, and a TWU value. The framework of the algorithm consists of three steps:

(1) The construction of IHUP-Tree, (2) the generation of

HTWUIs and (3) identification of high utility item-sets. The phase I of IHUP In step 1, items in the transaction are rearranged in a fixed order such as lexicographic order, support descending order or TWU descending order. Then, the rearranged transactions are inserted into the IHUP-Tree. In step 2, HTWUIs are generated from the IHUP-Tree by applying the FP-Growth algorithm [3]. Thus, HTWUIs in phase I can be found more efficiently without generating candidates for HTWUIs. In step 3, high utility itemsets and their utilities are identified from the set of HTWUIs by scanning the original database once.

Although IHUP finds HTWUIs without generating any candidates for HTWUIs and achieves a better performance than IIDS and Two-Phase, it still produces too many HTWUIs in phase I. Note that IHUP and Two-Phase produce the same number of HTWUIs in phase I since they use transactionweighted utilization mining model [5] to overestimate the utilities of the itemsets. However, this model may overestimate too many low utility itemsets as HTWUIs and produce too many candidate itemsets in phase I. Such a large number of HTWUIs degrades the mining performance in phase I in terms of execution time and memory consumption.

Be-sides, the number of HTWUIs in phase I also affects the performance of the algorithms in phase II since the more

HTWUIs are generated in phase I, the more execution time is required for identifying high utility itemsets in phase II.

As stated above, the number of HTWUIs generated in phase I forms a crucial problem to the performance of algorithms. In view of this, we propose four strategies to reduce the estimated utility values of the itemsets. By applying the pro-posed strategies, the number of candidates generated in phase I can be reduced effectively and the high utility item-sets can be identified more efficiently since the number of itemsets needed to be checked in phase II is highly reduced in phase I.

III.

P ROPOSED A PPROACH

We will propose a novel technique for high utility item set mining. The new algorithm will outperform the previous algorithms in terms of execution time.

Terminologies used are

Definition 1: A frequent itemset is a set of items that appears at least in a pre-specified number of transactions.

Formally, let I = {i1, i2, . . . , im} be a set of items and DB =

{T1, T2, ..., Tn} a set of transactions where every transaction is also a set of items (i.e. itemset).

Definition 2. The utility of an item ip is a numerical value yp defined by the user. It is transaction independent and

ISSN: 2231-5381 http://www.ijettjournal.org

Page 150

International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 3 - Jun 2014 reflects importance (usually profit) of the item. External utilities are stored in an utility table.

Definition 3: The utility of an item set X in a transaction Ti is denoted by U(X,Ti) & it is calculated as follows.

Definition 4: The utility of an item set X in D is denoted by

U(X) & it is calculated as follows

Definition 5. An itemset is called a high utility itemset if its utility is no less than a user-specified minimum utility threshold which is denoted as min_util . Otherwise, it is called a low utility itemset

Definition 6. The transaction utility of a transaction Td is denoted as TU ( T d) and defined as u ( Td , Td ).

The outline of the proposed algorithm is as follows:

Step 1: An initial list of high utility item set is generated as follows:

• Transaction Utility: (TU) the transaction utility of an item is the sum of the utilities of all items in that transaction.

• Weighted transaction utility of an item set: The weighted transaction utility of an item set is obtained by performing the addition of the transaction utility of all transactions containing that item set.

• Only those item sets are included in the initial high utility item set mining list whose weighted transaction utility is more than the minimum utility.

Step 2: In this step, final high utility item set is generated by eliminating the infrequent item sets from the list of step 1.

It is performed as follows:

• An item set is chosen from the list of step 1.

• If the utility of the item is less than the minutility

(Minimum Utility) than the item is erased. Otherwise, the item set is selected in the final list of the high utility item set.

The process is continued until list of step 1 is empty

R

EFERENCES

[1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules.

In Proc. of the 20th Int'l Conf. on Very Large Data Bases, pp. 487-499,

1994.

[2] C. F. Ahmed, S. K. Tanbeer, B.-S. Jeong, and Y.-K. Lee. Efficient tree structures for high utility pattern mining in incremental databases. In

IEEE Transactions on Knowledge and Data Engineering, Vol. 21, Issue

12, pp. 1708-1721, 2009.

[3] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc. of the ACM-SIGMOD Int'l Conf. on Management of Data, pp. 1-12, 2000.

[4] Y.-C. Li, J.-S. Yeh, and C.-C. Chang. Isolated items discarding strategy for discovering high utility itemsets. In Data & Knowledge

Engineering , Vol. 64, Issue 1, pp. 198-217, Jan., 2008.

[5] Y. Liu, W. Liao, and A. Choudhary. A fast high utility itemsets mining algorithm. In Proc. of the Utility-Based Data Mining Workshop, 2005.

[6] H. Yao, H. J. Hamilton, L. Geng, A unified framework for utility-based measures for mining itemsets. In Proc. of ACM SIGKDD 2nd

Workshop on Utility-Based Data Mining, pp. 28-37, USA, Aug., 2006

[7] S.-J. Yen and Y.-S. Lee. Mining high utility quantitative association rules. In Proc. of 9th Int'l Conf. on Data Warehousing and Knowledge

Discovery, Lecture Notes in Computer Science 4654, pp. 283-292,

Sep., 2007.

IV.

C ONCLUSIONS

In High Utility Itemset Mining the objective is to identify itemsets that have utility values above a given utility threshold. In this thesis we present a literature review of the pre-sent state of research and the various algorithms for high utility itemset mining. We also propose a new algorithm for high utility item set mining. The new algorithm outperforms previous algorithms in terms of execution time.

ISSN: 2231-5381 http://www.ijettjournal.org

Page 151

Download