Dynamic Data Mining* Vijay Raghavan and Alaaeldin Hafez1 (raghavan, ahafez)@cacs.louisiana.edu The Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, LA 70504, USA Abstract. Business information received from advanced data analysis and data mining is a critical success factor for companies wishing to maximize competitive advantage. The use of traditional tools and techniques to discover knowledge is ruthless and does not give the right information at the right time. Data mining should provide tactical insights to support the strategic directions. In this paper, we introduce a dynamic approach that uses knowledge discovered in previous episodes. The proposed approach is shown to be effective for solving problems related to the efficiency of handling database updates, accuracy of data mining results, gaining more knowledge and interpretation of the results, and performance. Our results do not depend on the approach used to generate itemsets. In our analysis, we have used an Apriori-like approach as a local procedure to generate large itemsets. We prove that the Dynamic Data Mining algorithm is correct and complete. 1 Introduction Data mining is the process of discovering potentially valuable patterns, associations, trends, sequences and dependencies in data [1-4,10,14,17,20,21]. Key business examples include web site access analysis for improvements in e-commerce advertising, fraud detection, screening and investigation, retail site or product analysis, and customer segmentation. Data mining techniques can discover information that many traditional business analysis and statistical techniques fail to deliver. Additionally, the application of data mining techniques further exploits the value of data warehouse by converting expensive volumes of data into valuable assets for future tactical and strategic business development. Management information systems should provide advanced capabilities that give the user the power to ask more sophisticated and pertinent questions. It empowers the right people by providing the specific information they need. Many knowledge discovery applications [6,8,9,12,13,15,16,18,19], such as online services and world wide web applications, require accurate mining information from data that changes on a regular basis. In such an environment, frequent or occasional updates may change the status of some rules discovered earlier. More * This research was supported in part by the U.S. Department of Energy, Grant No. DE-FG0297ER1220. 1 on leave from The Department of Computer Science and Automatic Control, Faculty of Engineering, Alexandria University, Alexandria, Egypt information should be collected during the data mining process to allow users to gain more complete knowledge of the significance or the importance of the generated data mining rules. Discovering knowledge is an expensive operation [2,4,5,6,9,10,11]. It requires extensive access of secondary storage that can become a bottleneck for efficient processing. Running data mining algorithms from scratch, each time there is a change in data, is obviously, not an efficient strategy. Using previously discovered knowledge along with new data updates to maintain discovered knowledge could solve many problems, that have faced data mining techniques; that is, database updates, accuracy of data mining results, gaining more knowledge and interpretation of the results, and performance. In this paper, we propose an approach, that dynamically updates knowledge obtained from the previous data mining process. Transactions over a long duration are divided into a set of consecutive episodes. In our approach, information gained during the current episode depends on the current set of transactions and the discovered information during the last episode. Our approach discovers current data mining rules by using updates that have occurred during the current episode along with the data mining rules that have been discovered in the previous episode. In section 2, a formal definition of the problem is given. The dynamic data mining approach is introduced in section 3. In section 4, the dynamic data mining approach is evaluated. The paper is summarized and concluded in section 5. 2 Problem Definition Association mining that discovers dependencies among values of an attribute was introduced by Agrawal et al.[1] and has emerged as an important research area. The problem of association mining, also referred to as the market basket problem, is formally defined as follows. Let I = {i1,i2, . . . , in} be a set of items and S = {s1, s2, . . ., sm} be a set of transactions, where each transaction si S is a set of items that is si I. An association rule denoted by X Y, X,Y I, and X Y = , describes the existence of a relationship between the two itemsets X and Y. Several measures have been introduced to define the strength of the relationship between itemsets X and Y such as SUPPORT, CONFIDENCE, and INTEREST [1,2,5,7]. The definitions of these measures, from a probabilistic view point, are given below. I. SUPPORT ( X Y ) P( X ,Y ) , or the percentage of transactions in the database that contain both X and Y. II. CONFIDENCE( X Y ) P( X ,Y ) / P( X ) , or the percentage of transactions containing Y in those transactions containing X. III. INTEREST(X Y ) P( X ,Y ) / P( X )P( Y ) represents a test of statistical independence. SUPPORT for an itemset S is calculated as SUPPORT ( S ) F ( S ) F where F(S) is the number of transactions having S, and F is the total number of transactions. For a minimum SUPPORT value MINSUP, S is a large (or frequent) itemset if SUPPORT(S) MINSUP, or F(S) F*MINSUP. Suppose we have divided the transaction set T into two subsets T1 and T2, corresponding to two consecutive time intervals, where F1 is the number of transactions in T1 and F2 is the number of transactions in T2, (F=F1+F2), and F1(S) is the number of transactions having S in T1 and F2(S) is the number of transactions having S in T2, (F(S)=F1(S)+F2(S)). By calculating the SUPPORT of S, in each of the two subsets, we get F (S ) F ( S ) and SUPPORT2 ( S ) 2 SUPPORT1 ( S ) 1 F2 F1 S is a large itemset if F1 ( S ) F2 ( S ) MINSUP , or F1 F2 F1 ( S ) F2 ( S ) ( F1 F2 )* MINSUP In order to find out if S is a large itemset or not, we consider four cases, S is a large itemset in T1 and also a large itemset in T2, i.e., F1 ( S ) F1 * MINSUP and F2 ( S ) F2 * MINSUP . S is a large itemset in T1 but a small itemset in T2, i.e., F1 ( S ) F1 * MINSUP and F2 ( S ) F2 * MINSUP . S is a small itemset in T1 but a large itemset in T2, i.e., F1 ( S ) F1 * MINSUP and F2 ( S ) F2 * min sup . S is a small itemset in T1 and also a small itemset in T2, i.e., F1 ( S ) F1 * MINSUP and F2(S)< F2*MINSUP. In the first and fourth cases, S is a large itemset and a small itemset in transaction set T, respectively, while in the second and third cases, it is not clear to determine if S is a small itemset or a large itemset. Formally speaking, let SUPPORT(S) = MINSUP + , where 0 if S is a large itemset, and 0 if S is a small itemset. The above four cases have the following characteristics, 1 0 and 2 0 1 0 and 2 0 1 0 and 2 0 1 0 and 2 0 S is a large itemset if F1 * ( MINSUP 1 ) F2 * ( MINSUP 2 ) MINSUP , or F1 F2 F1 * ( MINSUP 1 ) F2 * ( MINSUP 2 ) MINSUP * ( F1 F2 ) which can be written as F1 * 1 F2 * 2 0 Generally, let the transaction set T be divided into n transaction subsets Ti 's, 1 i n. n S is a large itemset if Fi * i 0 , where Fi is the number of transactions in Ti and i 1 i = SUPPORTi(S) - MINSUP, 1 i n. -MINSUP i 1-MINSUP, 1 i n. For those cases where n i 1 Fi * i 0 , there are two options, either discard S as a large itemset (a small itemset with no history record maintained), or keep it for future calculations (a small itemset with history record maintained). In this case, we are not going to report it as a large itemset, but its n F * formula will be maintained and checked through the future i i 1 i intervals. 3 The Dynamic Data Mining Approach For n i 1 Fi * i 0 , the two options described above could be combined into a single decision rule that says discard S if n i k Fi * ( MINSUP i ) n =1 i k Fi MINSUP , where 1 , and k1. Discard S from the set of a large itemsets (it becomes a small itemset with no history record) Keep it for future calculations (it becomes a small itemset with a history record) The value of determines how much history information would be carried. This history information along with the calculated values of locality can be used to determine the significance or the importance of the generated emerged-large itemsets. determine the significance or the importance of the generated declined-large itemsets. generate large itemsets with less SUPPORT values without having to rerun the mining procedure again. The choice of which value of to choose is the essence of our approach. If the value of is chosen to be near the value of 1, we will have less declined-large itemsets and more emerged-large itemsets, and those emerged-large itemsets are more to be occurred near the latest interval episodes. For those cases where the value of is chosen to be far from the value of 1, we will have more declined-large itemsets and less emerged-large itemsets, and those emerged-large itemsets are more to be large itemsets in the apriori-like approach. In this section, we introduce the notions of declined-large itemset, emerged-large itemset, and locality. Definition 3.1: Let S be a large itemset ( or a emerged-large itemset, please see definition 3.2) in a transaction subset Tl , l 1 . S is called a declined-large itemset in transaction subset Tn , n > l, if m MINSUP i k Fi * ( MINSUP i ) m i k MINSUP Fi for all l m n, where 1 k m , and 1 , Definition 3.2: S is called a emerged-large itemset in transaction subset Tn , n > 1, if S was a small itemset in transaction subset Tn-1 and Fn * n 0 , or S was a declined-large itemset in transaction subset Tn-1, n > 1, and n i k Fi * i 0 , k 1 . Definition 3.3: For an itemset S and a transaction subset Tn , locality(S) is defined as the ratio of the total size of those transaction subsets where S is either a large itemset or a emerged-large itemset to the total size of transaction subsets Ti , 1 i n . Fi i s .t . S is a l arg e itemset or a emerged l arg e itemset n i 1 Fi Clearly, the locality(S)=1 for all large itemsets S. The dynamic data mining approach generates three sets of itemsets, large itemsets, that satisfy the rule n i 1 Fi * i 0 , where n is the number of intervals carried out by the dynamic data mining approach declined-large itemsets, that were large at previous intervals and still m maintaining the rule , for some value . Fi * ( MINSUP i ) MINSUP i k m i k Fi MINSUP emerged-large itemsets, that were - either small itemsets and at a transaction subset Tk they satisfied the rule Fk * k 0 , and still satisfy the rule n i k - Fi * i 0 , or they were declined-large itemsets, and at a transaction subset Tm they satisfied the rule m , and still satisfy the rule n F * 0 . Fi * i 0 i i i k i k Example: Let I={a,b,c,d,e,f,g,h} be a set of items, MINSUP=0.35, and T be a set of transactions. For =1, Transaction Subset T1 Transaction Subset T2 Transaction Subset T3 Transactions count {a,b,g,h} {b,c,d} {a,c} {c,g} 3 10 2 4 {d,e,f} {e,g,h} {a,b,d} {b,d,f} {d,f,h} {c,h} {c,h} {b,d,g} {a,c} {b,c} {g,h} {a} {a,b,g,h} {b,c,d} {a,c} {c,g} 1 4 2 1 5 5 12 8 9 1 4 10 5 10 2 4 {d,e,f} {e,g,h} {a,b,d} {b,d,f} {d,f,h} {c,h} 1 4 2 1 5 5 Transactions count {a,b,g,h} {b,c,d} {a,c} {c,g} 3 10 2 4 {d,e,f} {e,g,h} {a,b,d} {b,d,f} {d,f,h} {c,h} {c,h} {b,d,g} {a,c} {b,c} {g,h} 1 4 2 1 5 5 12 8 9 1 4 {a} {a,b,g,h} {b,c,d} {a,c} {c,g} 10 5 10 2 4 {d,e,f} {e,g,h} {a,b,d} {b,d,f} {d,f,h} {c,h} 1 4 2 1 5 5 large or emerged-large itemsets {b} {c} {d} {h} count SUPPORT status locality 16 21 14 17 0.43 0.57 0.38 0.46 large itemset large itemset large itemset large itemset 1 1 1 1 {bd} 13 0.35 large itemset 1 {b} {c} {h} {ch} 25 43 33 12 0.35 0.60 0.46 0.35 large itemset large itemset large itemset emerged-large itemset 1 1 1 {a} {b} {c} {h} 19 43 64 52 0.39 0.36 0.53 0.43 emerged-large itemset large itemset large itemset large itemset 0.41 1 1 1 For =2, Transaction Subset T1 Transaction Subset T2 Transaction Subset T3 large or emerged-large itemsets {b} {c} {d} {h} count SUPPORT Status Locality 16 21 14 17 0.43 0.57 0.38 0.46 Large itemset Large itemset Large itemset Large itemset 1 1 1 1 {bd} 13 0.35 Large itemset 1 {b} {c} {d} {g} {h} {bd} {ch} {a} {b} {c} {d} {g} 25 43 22 12 33 18 12 19 43 64 36 25 0.35 0.60 0.31 0.35 0.46 0.25 0.35 0.39 0.36 0.53 0.3 0.30 large itemset large itemset declined-large itemset emerged-large itemset large itemset declined-large itemset emerged-large itemset emerged-large itemset large itemset large itemset declined-large itemset declined-large itemset 1 1 0.52 0.48 1 0.52 0.48 0.41 1 1 0.31 0.28 {h} {bd} {ch} 52 31 17 0.43 0.26 0.20 large itemset declined-large itemset declined-large itemset 1 0.31 0.28 When applying an Apriori-like Algorithm on the whole file, the resulting large itemsets are large itemsets {b} {c} {h} count 43 64 52 SUPPORT 0.39 0.58 0.47 By comparing the results in the previous example, we can come with some intuitions about the proposed approach, which can by summarized as, The set of large itemsets and emerged-large itemsets generated by our Dynamic approach is a superset of the set of large itemsets generated by the Apriori-like approach. If there is an itemset generated by our Dynamic approach but not generated by the Apriori-like approach as a large itemset, then this itemset should be large at the latest consecutive time intervals, i.e., a emerged-large itemset. In lemmas 3.1 and 3.2, we proves the above intuitions. lemma 3.1: For a transaction set T, the set of large itemsets and emerged-large itemsets generated by our Dynamic approach is a superset of the set of large itemsets generated by the Apriori-like approach. proof: Let iTi=T, 1 I n, Fi=|Ti| and S be a large itemset that is generated by the n Apriori-like approach, i.e., F * 0 , and not by our Dynamic approach. There i 1 i i two cases to consider, Case 1 ( =1) For a transaction subset Tk , 1 k n, S is discarded from the set of a large itemsets, if it becomes a small itemset, i.e., k F * 0 , 1 m k, and no history i im i is recorded. Since no history is recorded before m, that means m 1 i 1 k leads to i 1 Fi * i 0 . For k=n, we have n F * i i 1 i Fi * i 0 . That 0 , which contradicts our assumption. Case 2: >1 For a transaction subset Tk , 1 k n, S is discarded from the set of a large itemsets, if it becomes a small itemset, i.e., k im Fi * i 0 , 1 m k, and depending on the value of , its history is started to be recorded in transaction m 1 subset Tm. Since no history is recorded before m, that means i 1 leads to k i 1 Fi * i 0 . For k=n, we have n i 1 Fi * i 0 . That Fi * i 0 , which contradicts our assumption. lemma 3.2: If there is an itemset generated by our Dynamic approach but not generated by the Apriori-like approach as a large itemset, then this itemset should be large at the latest consecutive time intervals, i.e., a emerged-large itemset. proof: By following the proof of lemma 3.1, the proof is straight forward. Algorithm DynamicMining (Tn) f 1 (Tn ) is the set of * 1 f (Tn ) is the set of l arg e and emerged l arg e itemsets. declined l arg e itemsets. x is the accumulated value of Fi * ix . is the accumulated value of Cl x is the. accumulated value of Fi Fi . where itemset x is l arg e begin Fi f1 (Tn ) { (x,Clx ) , Clx Fn | x f1(Tn -1 ) x f1* (Tn -1 ) Fn * nx 0} {(x,Clx ) , Clx Clx Fn | x Fn * nx 0 } //large or emerged-large itemset f 1* (Tn ) { (x,Cl x ) | MINSUP * MINSUP x MINSUP for (k=2;fk-1(Tn);k++) do begin Ck=AprioriGen(fk-1(Tn) fk-1*(Tn)) forall transactions t T do } //was large itemset n forall candidates cCk do if c t then c.count++ f k (Tn ) { (x,Cl x ), x c, Cl x Fn | x f k (Tn-1 ) x f k* (Tn-1 ) Fn * nx 0 } {(x,Cl x ), x c, Cl x Cl x Fn | x Fn * n 0 } f k* (Tn ) { (x,Cl x ) | MINSUP * MINSUP x MINSUP } end return fk(Tn) and fk*(Tn) end function AprioriGen(fk-1) insert into Ck select l1,l2, . . .,lk-1,ck-1 from fk-1 l, fk-1 c where l1=c1 l2=c2 . . . lk-2=ck-2 lk-1<ck-1 delete all items cCk such that (k-1)-subsets of c are not in fk-1(Tn) return Ck lemma 3.3: The Dynamic Data Mining approach is correct. proof: (See lemmas 3.1 and 3.2) 4 Analysis and Performance Study In the DynamicMining algorithm, we used an Apriori-like approach as a local procedure to generate large or emerged-large itemsets. We would like to emphasize the fact that our approach does not depend on the approach used to generate itemsets. The main contribution of our approach, is to dynamically generate large itemsets using only the transaction updates and the information collected in the previous data mining episode. Assuming that an Apriori-like procedure is used as a local procedure, the total number of disk accesses needed for performing the DynamicMining algorithm is n K N where Ni is the size (no of disk blocks) of the transaction subset Ti , 1 i n, i 1 i i and Ki is the length of longest large itemset. On the other hand, the total number of disk accesses needed for performing an Apriori-like algorithm, which is carried each time on the whole transaction file is n i 1 n i K i j 1 N j or K* i 1 i (# of disk blocks of T ) In our preliminary experimental results, the Dynamic Mining algorithm has shown a significant potential usage. Four main factors have been considered in our study, namely, 5 The performance of the Dynamic Mining algorithm in terms of disk access, and CPU time. The knowledge gained by using different values of . The effect of the locality values on the knowledge discovered through the data mining process. The generation of the emerged-large itemsets and declined-large itemsets and the significance of having this information. Conclusions and Future Work In this paper, we have introduced a Dynamic Data Mining approach. The proposed approach performs periodically the data mining process on data updates during a current episode and uses that knowledge captured in the previous episode to produce data mining rules. We have introduced the concept of locality along with the definitions of emerged-large itemsets and declined-large itemsets. The new approach solves some of the problems that current data mining techniques suffer from, such as, database updates, accuracy of data mining results, gaining more knowledge and interpretation of the results, and performance. We have discussed the Dynamic Data Mining approach. In our approach, we dynamically update knowledge obtained from the previous data mining process. Transactions domain is treated as a set of consecutive episodes. In our approach, information gained during a current episode depends on the current set of transactions and that discovered information during the previous episode. In our preliminary experimental results, the Dynamic Mining algorithm has shown a significant potential usage. We have discussed the efficiency of the Dynamic Mining algorithm in terms of disk accesses. Also, we have shown the significance of the knowledge discovered by using different values of , and the effect of the locality values along with the generation of the emerged-large itemsets and declined-large itemsets on that knowledge. Finally, we have proved that the Dynamic Data Mining algorithm is correct. As a future work, the Dynamic approach will be tested with different datasets that cover a large spectrum of different data mining applications, such as, web site access analysis for improvements in e-commerce advertising, fraud detection, screening and investigation, retail site or product analysis, and customer segmentation. References [1] R. Agrawal, T. Imilienski, and A. Swami, "Mining Association Rules between Sets of Items in Large Databases," Proc. of the ACM SIGMOD Int'l Conf. On Management of data, May 1993. [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] R. Agrawal, and R. Srikant, "Fast Algorithms for Mining Association Rules," Proc. Of the 20 th VLDB Conference, Santiago, Chile, 1994. R. Agrawal, J. Shafer, "Parallel Mining of Association Rules," IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6, Dec. 1996. C. Agrawal, and P. Yu, "Mining Large Itemsets for Association Rules," Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 1997. S. Brin, R. Motwani, et al, "Dynamic Itemset Counting and Implication Rules for Market Basket Data," SIGMOD Record (SCM Special Interset Group on Management of Data), 26,2, 1997. S. Chaudhuri, "Data Mining and Database Systems: Where is the Intersection," Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 1997. M. Chen, J. Han, and P. Yu, "Data Mining: An Overview from a Database Prospective", IEEE Trans. Knowledge and Data Engineering, 8, 1996. M. Chen, J. Park, and P. YU, "Data Mining for Path Traversal Patterns in a Web Environment", Proc. 16th Untl. Conf. Distributed Computing Systems, May 1996. D. Cheung, J. Han, et al, " Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique", In Proc. 12th Intl. Conf. On Data Engineering, New Orleans, Louisiana, 1996. U. Fayyed, G. Shapiro, et al, "Advances in Knowledge Discovery and Data Mining", AAAI/MIT Press, 1996. A. Hafez, J. Deogun, and V. Raghavan ,"The Item-Set Tree: A Data Structure for Data Mining", DaWaK' 99 Conference, Firenze, Italy, Aug. 1999. C. Kurzke, M. Galle, and M. Bathelt, "WebAssist: a user profile specific information retrieval assistant," Seventh International World Wide Web Conference, Brisbone, Australia, April 1998. M. Langheinrichl, A. Nakamura, et al ,"Un-intrusive Customization Techniques for Web Advertising," The Eighth International World Wide Web Conference, Toronto, Canada, May 1999 H. Mannila, H. Toivonen, and A. Verkamo, "Efficient Algorithms for Discovering Association Rules," AAAI Workshop on Knowledge Discovery in databases (KDD-94) , July 1994. M. Perkowitz and O. Etzioni, "Adaptive Sites: Automatically Learning from User Access Patterns", In Proc. 6th Int. World Wide Web Conf., santa Clara, California, April 1997. P. Pitkow, "In Search of Reliable Usage Data on the WWW", In Proc. 6 th Int. World Wide Web Conf., santa Clara, California, April 1997. G. Rossi, D. Schwabe, and F. Lyardet, "Improving Web Information Systems with Navigational Patterns," The Eighth International World Wide Web Conference, Toronto, Canada, May 1999 N. Serbedzija, "The Web Supercomputing Environment," Seventh International World Wide Web Conference, Brisbone, Australia, April 1998. T. Sullivan, "Reading Reader Reaction: A Proposal for Inferential Analysis of Web Server Log Files", In Proc. 3rd Conf. Human Factors & The Web, Denver, Colorado, June 1997. C. Wills, and M. Mikhailov, "Towards a Better Understanding of Web Resources and Server Responses for Improved Caching," The Eighth International World Wide Web Conference, Toronto, Canada, May 1999 M. Zaki, S. Parthasarathy, et al, " New Algorithms for Fast Discovery of Association Rules," Proc. Of the 3 rd Int'l Conf. On Knowledge Discovery and data Mining (KDD97), AAAI Press, 1997.