Synthetic Databases Realistic Data for Testing Rule Mining Algorithms Colin Cooper Department of Computer Science Kings’ College London, WC2R 2LS United Kingdom voice: +44 20-7848-2002 email: colin.cooper@kcl.ac.uk Michele Zito* Department of Computer Science University of Liverpool Ashton Street Liverpool, L69 3BX United Kingdom voice: +44 151-795-4263 email: michele@liverpool.ac.uk (* Corresponding author) Realistic Data for Testing Rule Mining Algorithms Colin Cooper, Kings’ College, London, UK Michele Zito, University of Liverpool, UK INTRODUCTION The Association Rule Mining (ARM) problem is a well-established topic in the field of Knowledge Discovery in Databases. The problem addressed by ARM is to identify a set of relations (associations) in a binary valued attribute set which describe the likely coexistence of groups of attributes. To this end it is first necessary to identify sets of items that occur frequently, i.e. those subsets F of the available set of attributes I for which the support (the number of times F occurs in the dataset under consideration), exceeds some threshold value. Other criteria are then applied to these item-sets to generate a set of association rules, i.e. relations of the form A B , where A and B represent disjoint subsets of a frequent item-set F such that A B F . A vast array of algorithms and techniques has been developed to solve the ARM problem. The algorithms of Agrawal & Srikant (1994), Bajardo (1998), Brin et al. (1997), Han et al. (2000), and Toivonen (1996), are only some of the best-known heuristics. There has been recent growing interest in various areas of Computer in the class of socalled heavy tail statistical distributions. Distributions of this kind had been used in the past to describe word frequencies in text (Zipf, 1949), the distribution of animal species (Yule, 1925) and of income (Mandelbrot, 1960), scientific citations count (Redner, 1998) and many other phenomena. They have been used recently to model various statistics of the web and other complex networks Science (Barabasi & Albert, 1999; Faloutsos et al., 1999; Steyvers & Tenenbaum, 2005). BACKGROUND Although the ARM problem is well studied, several fundamental issues are still unsolved. In particular the evaluation and comparison of ARM algorithms is a very difficult task (Zaiane et al., 2005), and it is often tackled by resorting to experiments carried out using data generated by the well established QUEST program from the IBM Quest Research Group (Agrawal & Srikant, 1994). The intricacy of this program makes it difficult to draw theoretical predictions on the behaviour of the various algorithms on input produced by this program. Empirical comparisons made in this way are also difficult to generalize because of the wide range of possible variation, both in the characteristics of the data (the structural characteristics of the synthetic databases generated by QUEST are governed by a dozen of interacting parameters), and in the environment in which the algorithms are being applied. It has also been noted (Brin et al., 1997) that data sets produced using the QUEST generator might be inherently not the hardest to deal with. In fact there is evidence that suggests that the performances of some algorithms on real data is much worse than those found on synthetic data generated using QUEST (Zheng et al., 2001). MAIN FOCUS The purpose of this short contribution is two-fold. First of all additional arguments are provided supporting the view that real-life databases show structural properties that are very different from those of the data generated by QUEST. Secondly, a proposal is described for an alternative data generator that is simpler and more realistic than QUEST. The arguments are based on results described in Cooper & Zito (2007). Heavy-tail distributions in Market Basket Databases To support the claim that real market-basket databases show structural properties that are quite different from those of the data generated by QUEST, Cooper and Zito analyzed empirically the distribution of item occurrences in four real-world retail databases widely used as test cases and publicly available from http://fimi.cs.helsinki.fi/data/. Figure 1 shows such distribution (on a log-log scale) for two of these databases. Figure 1. Log-log plots of the real-life data sets along with the best fitting lines The authors suggest that in each case the empirical distribution may fit (over a wide range of values) a heavy-tailed distribution. Furthermore they argue that the data generated by QUEST shows quite different properties (even though it has similar size and density). When the empirical analysis mentioned above is performed on data generated by QUEST (available from the same source) the results are quite different from those obtained for real-life retail databases (see Figure 2). Figure 2. Log-log plots of the QUEST data sets along with the best fitting line Differences have been found before (Zheng et al., 2001) in the transaction sizes of the real-life vs. QUEST generated databases. However some of these differences may be ironed out by a careful choice of the numerous parameters that controls the output of the QUEST generator. The results of Cooper and Zito may point to possible differences at a much deeper level. A closer look at QUEST Cooper and Zito also start a deeper theoretical investigation of the structural properties of the QUEST databases proposing a simplified version of QUEST whose mathematical properties could be effectively analyzed. As the original program, this simplified version returns two related structures: the actual database D and a collection T of potentially large item-sets (or patterns) that is used to populate D. However in the simplified model, it is convenient to assume that each transaction is formed by the union of k elements of T, chosen independently uniformly at random (with replacement). The patterns in T are generated by first selecting a random set of s items, and then, for each of the other patterns, by choosing (with replacement) elements uniformly at random from those belonging to the last generated pattern and s remaining elements uniformly at random (with replacement) from the whole set of items. Let degD (v) (resp. degT (v) ) denote the number of transactions in D (resp. patterns in T) containing item v . Assume that h , the total number of transactions, is a polynomial in n and from the definition of the generation process given above that, for each l n . It follows directly item v , degD (v) has a binomial distribution with parameters h and i1 h k k (1) pk,l i E(degT (v) i ) , and the expected value of N r is n ( pk,l ) r (1 pk,l ) hr . i1 i l r Moreover, at least in the restricted case when s 2 , by studying the asymptotic distribution of , it is possible to prove that, for constant values of degT (v) k and large values of n , pk,l is approximately 2 n 1 and N r is very close to its expected value. Hence for large r , the proportion of items occurring in r transaction decays much faster than r z for any fixed z 0. For instance, N h if k 1, then r ( pk,l ) r (1 pk,l ) hr . r n An Alternative Proposal Cooper and Zito study of synthetic database generator also points to possible alternatives to the IBM generator. In fact, a much more effective way of generating realistic databases is based on building the database sequentially, adding the transactions one at the time, choosing the items in each transaction based on their (current) popularity (a mechanism known as preferential attachment). The database model proposed by Cooper and Zito (which will be referred to as CoZi from now on) is in line with the proposal of Barabasi & Albert (1999), introduced to describe structures like the scientific author citation network or the world-wide web. Instead of assuming an underlying set of patterns T from which the transactions are built up, the elements of D are generated sequentially. At the start there is an initial set of e0 transactions on n0 existing items. CoZi can generate transactions based entirely on the n0 initial items, but in general new items can also be added to newly defined transactions, so that at the end of the simulation the total number of items is n n0 . The simulationproceeds for a number of steps generating groups of transactions at each step. For each group in the sequence there are four choices made by the simulation at step t : 1. The type of transaction. An OLD transaction (chosen with probability 1 ) consists of items occurring in previous transactions. A NEW transaction (chosen with probability ) consists of a mix of new items and items occurring in previous transactions. 2. The number of transactions in a group, mO (t) (resp. mN (t) ) for OLD(resp. NEW) transactions. This can be a fixed value, or given any discrete distribution with mean mO (resp. m N ). Grouping of a particular item in a group of transactions in the QUEST corresponds to e.g. the persistence model. 3. The transaction size. This can again be a constant, or given by a probability distribution with mean . 4. The method of choosing the items in the transaction. If transactions of type OLD (resp. NEW) are chosen in a step we assume that each of them is selected using preferential attachment with probability PO (resp. PN ) and randomly otherwise. The authors provide: 1. a proof that the CoZi model is fit for its purposes, and 2. details of a simple implementation in Java, available from the authors web-sites. More specifically, following Cooper (2006) they prove that, provided that the number of transactions is large, with probability approaching one, the distribution of item occurrence in D follows a power law distribution with z 1 parameter 1 , where mN ( 1)PN (1 )mO PO . In other words, the number of items occurring r times after ( mN (1 )mO ) t steps of the generation process is approximately Ctr z for large r and some constant C 0 . Turning to examples, in the simplest case, the group sizes are fixed (say behaviour is the same ( PN PO P . mN (t) mO (t) 1 always) and the preferential attachment Thus 1 P , and z 1 . A simple implementation in Java of the CoZi generator, P based on these settings (and the additional assumption that the transaction sizes are given by the absolute value of a normal distribution), http://www.csc.liv.ac.uk/~michele/soft.html. is available Figure 3 displays item distribution plots obtained from running the program with P 50% for different values of h and . Figure 3. Log-log plots of two CoZi data sets along with the best fitting line at While there are many alternative models for generating heavy tailed data (Mitzenmacher, 2004; Watts, 2004) and different communities may prefer to use alternative processes, we contend that synthetic data generators of this type should be a natural choice for the testing of ARM algorithms. FUTURE TRENDS Given the ever increasing needs to store and analyse large amounts of data, we anticipate that tasks like ARM or classification or pattern analysis will acquire increasing importance. In this context it will be desirable to have sound mathematical models that could be used to generate synthetic data or test cases. As a step in this direction, in our recent work, we looked at models for market-basket databases. Another interesting area one could look at is that of generators of classifier test cases. We believe that the Theory of Random Processes, and, more generally, Probability Theory, could provide a range of techniques to devise realistic models and prove interesting structural properties of the resulting data sets. CONCLUSION The association rule mining problem is a very important topic within the Data Mining research field. We provided additional evidence supporting the claim that, although a large array of algorithms and techniques exist to solve this problem, the testing of such algorithms is often done resorting to un-realistic synthetic data generators. We also put forward an alternative synthetic data generator that is simple to use, mathematically sound, and generates data that is more realistic than the one obtained from other generators. REFERENCES Agrawal, R. and Srikant, R., 1994. Fast algorithms for mining association rules in large databases. VLDB '94: Proceedings of the 20th International Conference on Very Large Data Bases, San Francisco, USA, pp. 487-499. Barabasi, A. and Albert, R., 1999. Emergence of scaling in random networks. Science, Vol. 286, pp. 509-512. Bayardo, R. J., 1998. Efficiently mining long patterns from databases. SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, Seattle, USA, pp. 85-93. Brin, S. et al, 1997. Dynamic itemset counting and implication rules for market basket data. SIGMOD '97: Proceedings of the 1997 ACM SIGMOD international conference on Management of data, Tucson, USA, pp. 255-264. Cooper, C., 2006. The age specific degree distribution of web-graphs. Combinatorics, Probability and Computing, Vol. 15, No. 5, pp. 637-661. Cooper, C., & Zito, M. (2007) Realistic synthetic data for testing association rule mining algorithms for market basket databases. In PKDD Faloutsos, M., Faloutsos, P., & Faloutsos, C. (1999) On power-law relationships of the internet topology. ACM SIGCOMM Computer Communication Review, 29(4):251-262. Han, J. et al, 2000. Mining frequent patterns without candidate generation. SIGMOD '00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data, Dallas, USA, pp. 1-12. Mandelbrot, B. (1960) The Pareto-Levy law and the distribution of income. International Economic Review, 1:79-106. Mitzenmacher, M. (2004) A brief history of generative models for power-law and log-normal distributions. Internet Mathematics, 1(2):226-251. Redner, S. (1998) How popular is your paper? An empirical study of the citation distribution. European Physical Journal, B 4:401-404. Steyvers, M., & Tenenbaum, J. B. (2005) The large-scale structure of semantic networks: statistical analysis and a model of semantic growth. Cognitive Science, 29:41-78. Toivonen, H. (1996) Sampling large databases for association rules. In VLDB ’96: Proceedings of the 22nd International Conference on Very Large Databases, pages 134-145, San Francisco, USA: Morgan Kauffmann Publishers Inc. Yule, U. (1925) A Mathematical Theory of Evolution Based on the Conclusions of Dr. J. C. Wills, F. R. S. Philosophical Transactions of the Royal Society of London, 213 B:21-87. Watts, D. J. 2004. The "new" science of networks. Annual Review of Sociology, Vol. 30, pp. 243-270. Zaiane, O., El-Hajj, M., Li, Y., & Luk, S. (2005) Scrutinizing frequent pattern discovery performance. In ICDE ’05: Proceedings of the 21st International Conference on Data Engineering (ICDE’05), pages 1109-1110, Washington DC, USA: IEEE Computer Society. Zaki, M. J. and Ogihara, M., 1998. Theoretical foundations of association rules. Proc. 3rd SIGMOD Worksh. Research Issues in Data Mining and Knowledge Discovery, Seattle, USA, pp. 1-8. Zheng, Z. et al, 2001. Real world performance of association rule algorithms. KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, San Francisco, USA, pp. 401-406. Zipf, G. K. (1949) Human Behaviour and the Principle of Least Effort. Addison-Wesley. KEY TERMS AND THEIR DEFINITIONS Association Rule Mining: The problem addressed is to identify a set of relations in a binary valued attribute set which describe the likely coexistence of groups of attributes. Best fitting line: On a data diagram, this is the line drawn as near as possible to the various points so as to best represent the trend being graphed. The sums of the displacements of the points on either side of the line should be equal. Heavy-tailed distribution: A statistical distribution is said to have a heavy tail if the fraction of the population up to a certain value x decays more slowly than ecx , for some c 0 , as x tends to infinity. Power-law distribution: A statistical distribution is said to follow a power law decay if the fraction of the population up to a certain value x decays like x c , for some c 0 , as x tends to infinity. Preferential attachment: In a systemof numerical quantities, randomly modified as time goes by, it is a positive feedback mechanism by which larger increases tend to accumulate on already large quantities. Random Process: A random process is a sequence of random variables.