International Journal of Engineering Trends and Technology- Volume3Issue3- 2012 Survey Methods on Measuring Minimum Threshold for Association Rules Pinnaboyina Tejaswi #1 Dhulipalla Navya #1 and D. Radharani #2 #1 Student, M.Tech (CSE) Green Fields, K.L.Univerisity, Vaddeswaram, India. #2 Assistant Professor, (CSE) Green Fields, K.L.Univerisity, Vaddeswaram, India. Abstract: The extraction of information from large amount of database in order to mine the frequent items the association rules are used to discover the frequent items, with that minimum threshold value the rules have been discovered and these minimum threshold value can be calculated by using some statistical methods like support, confidence, lift, leverage all these methods is discussed in this paper. Keywords: Minimum threshold value, Analyst, Support, Confidence, Lift, Leverage, Correlation, Conditional Probability I. INTRODUCTION The retrieving of most useful information from the database the discovering of this information may help the market analysts in order to know which items are most frequently occurred and by this analysis they can take decisions for the items in their market database. The rules are extracted by using minimum threshold value the minimum threshold value can be set by using the measures like support, confidence, lift, leverage, conditional probability and correlations by using all these methods the minimum threshold value can be set. In order to set these minimum threshold it should very accurate and settings this minimum threshold may come across the problems like if the minimum threshold is set too low the inconsistency in the database may occur and if the minimum threshold value is set too high means the rare items which are more frequent can be missing since all the rare items are discarded if it is set high so it may leads the decision maker to make wrong decisions so in order to overcome all these we are going to discuss some measures to mine association ISSN: 2231-5381 rules. The analyst is who makes the decisions to improve the productivity as the ecommerce applications are increasing rapidly the analyst are interested in knowing the frequent items that are purchased in their market so that they can take actions like increasing in the productivity of that frequently purchased items or to arrange the frequently sold items according to their occurrence(for e.g. the association rules like this bread->butter and their probability of their occurrence is high according to their occurrence the minimum threshold is kept so that the items which satisfy that min-threshold are mined and formed as frequent items by knowing this details the analyst may arrange the items side by side to increase the sales) the minimum threshold value is described as the cutoff point and this can be fixed by observing the different patterns. II. METHODS As discussed in order to discover the items we have different methods the first stage all the items are which are present in the database is represented and then by using some combinations the 2 items that are frequent is represented by the rule and by the minimum threshold the items which are below the threshold is discarded and further it is applied for all the subset and it is applied until all the specified iterations are achieved like until (Ø value is occurred) in this way all the items or the transactions in the database is searched and then the rules are formed by antecedent or precedent by this rules all the items which satisfy those rules are extracted and by using that threshold values finally the item sets which are http://www.internationaljournalssrg.org Page 286 International Journal of Engineering Trends and Technology- Volume3Issue3- 2012 most frequently occurred in the database is extracted these threshold value can be measured in different statistical methods like support, confidence, lift, leverage, conditional probability, correlation[1]. III. SUPPORT In data mining association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. Piatetsky-Shapiro [4] describes analyzing and presenting strong rules discovered in databases using different methods of in te r e st in g n es s . Based on the concept of strong rules, Agrawal et.al [2] introduced association rules for discovering regularities between products in large scale transaction data recorded by point-of scale systems in super markets. For example, the rule {onions, potatoes}=>{burger} found in the sales data of a super market would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy burger . Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements. In addition to the above example from market basket analysis association rules are employed today in many application areas including web usage mining, intrusion detection and bioinformatics. To illustrate the concepts, a small example from the supermarket domain. The set of items is I = {milk, bread, butter, beer} and a small database containing the items (1 indicates presence and 0 indicates absence of an item in a transaction) is shown in the table [1]. An example rule for the supermarket could be {butter, bread} => {burger} meaning that if butter and bread is bought, customers also buy milk. In practical applications, a rule needs a support of several hundred transactions before it can be considered statistically significant and datasets often contain thousands or millions of transactions. To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The bestknown constraints are minimum thresholds on support and confidence. The support supp(X) of an item set X is defined as the proportion of transactions in the data set which contain the item set. In the example database, the item set {milk, bread, butter} has a support of 1 / 5 = 0.2 since it occurs in 20% of all transactions (1 out of 5 transactions). ISSN: 2231-5381 TABLE 1 Database with Four Transactions and Five Items [2] Tid 1 2 3 4 5 Milk 1 0 0 1 0 Bread 1 0 0 1 1 Butter 0 1 0 1 0 Beer 0 0 1 0 0 IV. CONFIDENCE The confidence of a rule is defined conf{X=>Y} =supp{X∪ Y}/supp{X}. For example, the rule {milk, bread} => {butter} has a confidence of 0.2 / 0.4 = 0.5 in the database, which means that for 50% of the transactions containing milk and bread the rule is correct. Confidence can be interpreted as an estimate of the probability P (Y|X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS [5]. A. LIFT The lift of a rule is defined by the equation Lift(X=>Y) = (supp (X ∪ Y)/supp (Y) *supp (X)) or the ratio of the observed to support that expected X and Y are independent. B. CONVICTION The rule {milk, bread}=>{butter} has a lift of ((0.2)/ (0.4)*(0.4)) =1.25.The conviction of a rule is defined as conv(X=>Y) =(1-supp(Y)/1conf(X=>Y)). The rule {milk, bread} => {butter} has a conviction of ((1-0.4)/ (1-0.5)) =1.2, and can be interpreted as the ratio of the expected frequency that X occurs without Y (that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were independent divided by the observed frequency of incorrect predictions. In this example, the conviction value of 1.2 shows that the rule {milk, bread}=>{butter} would be incorrect 20% more often (1.2 times as often) if the association between X and Y was purely random chance. For e.g. Interestingness Measure: Correlations (Lift) the correlation how strong one item is dependent on the other item. Play basketball eat cereal [40%, 66.7%] is misleading. The overall % of students eating cereal is 75% > 66.7%. http://www.internationaljournalssrg.org Page 287 International Journal of Engineering Trends and Technology- Volume3Issue3- 2012 Play basketball not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence. large transactional DBs, all-conf or coherence could be good measures, Both all-conf and coherence have the downward closure property. Measure of dependent/correlated events: lift = ( , )= ,ﬧ ( ∪ ) ( ). ( ) V. CONCLUSION (2000/5000) = 0.89 3000 ∗ (3750/5000) 5000 1000 5000 = = 1.33 3000 1250 ∗ 5000 5000 TABLE 2 The items of a database Basket ball cereal Not cereal Sum(col.) 2000 1000 3000 Not basket ball 1750 250 2000 Sum(row) 3750 1250 5000 In this way the items of the database which are having correlated are mined by using some conditional probability liking by taking the condition we are calculating the occurrence and non occurrence of the event. Is lift and 2 Good Measures of Correlations Buy walnuts buy milk [1%, 80%]” is misleading if 85% of customers buy milk and by the results we can say that Support and confidence are not good to represent correlations. = = REFERENCES [1] Alex Tze Hiang Sim,Maria Indrawan, Samar Zutshi, Member,IEEE, and Bala Srinivasan,“Logic-Based Pattern Discovery,”IEEE transactions on knowledge and data engineering, vol. 22,no.6,2010. [2] R.Agrawal,T. Imielinski, and A. Swami, “Mining Association Rules between Sets of Items in Large Databases,” SIGMOD Record, vol. 22, pp. 207-216, 1993. [3]C. Longbing, “Introduction to Domain Driven Data Mining,” Data Mining forBusiness Applications, L.Cao,P.S. Yu, C. Zhang, and H. Zhang, eds., pp. 3-10,Springer,2008. [4]Piatetsky-Shapiro,G.(1991),Discovery, analysis and presentation of strong rules, in G.Piatetsky – Shapiro & W. J. Frawley, “Knowledge Discovery in Databases,” AAAI/MIT Press, Cambridge, MA. [5]Jochen Hipp, Ulrich Güntzer, and Gholamreza Nakhaeizadeh “Algorithms for association rule mining”- A general survey and comparison, SIGKDD,Explorations, 2(2):1-58, 2000. ( ∪ ) ( ). ( ) (max_ ℎ= All the statistical measures that are used for association rules is discussed and according to these measures the items which are more frequent is discovered from the database. ( ) _sup( )) ( ) |universe( )| All the statistical measures like support, confidence ,lift, coherence is measured and these results leads to extract the items from the database and which measures should be used is discussed lift and 2 are not good measures for correlations in ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 288