Data Mining on Incomplete Data - the University of Liverpool

Synthetic Databases
Realistic Data for Testing Rule Mining Algorithms
Colin Cooper
Department of Computer Science
Kings’ College
London, WC2R 2LS
United Kingdom
voice: +44 20-7848-2002
Michele Zito*
Department of Computer Science
University of Liverpool
Ashton Street
Liverpool, L69 3BX
United Kingdom
voice: +44 151-795-4263
(* Corresponding author)
The Association Rule Mining (ARM) problem is a well-established topic in the field of
Knowledge Discovery in Databases. The problem addressed by ARM is to identify a set of
relations (associations) in a binary valued attribute set which describe the likely coexistence of
groups of attributes. To this end it is first necessary to identify sets of items that occur frequently,
i.e. those subsets F of the available set of attributes I for which the support (the number of
times F occurs in the dataset under consideration), exceeds some threshold value. Other criteria
are then applied to these item-sets to generate a set of association rules, i.e. relations of the form
A  B , where A and B represent disjoint subsets of a frequent item-set F such that A  B  F .
A vast array of algorithms and techniques has been developed to solve the ARM problem. The
algorithms of Agrawal & Srikant (1994), Bajardo (1998), Brin et al. (1997), Han et al. (2000),
and Toivonen (1996), are only some of the best-known heuristics.
There has been recent growing interest in various areas of Computer in the class of socalled heavy tail statistical distributions. Distributions of this kind had been used in the past to
describe word frequencies in text (Zipf, 1949), the distribution of animal species (Yule, 1925)
and of income (Mandelbrot, 1960), scientific citations count (Redner, 1998) and many other
phenomena. They have been used recently to model various statistics of the web and other
complex networks Science (Barabasi & Albert, 1999; Faloutsos et al., 1999; Steyvers &
Tenenbaum, 2005).
Although the ARM problem is well studied, several fundamental issues are still unsolved.
In particular the evaluation and comparison of ARM algorithms is a very difficult task (Zaiane et
al., 2005), and it is often tackled by resorting to experiments carried out using data generated by
the well established QUEST program from the IBM Quest Research Group (Agrawal & Srikant,
1994). The intricacy of this program makes it difficult to draw theoretical predictions on the
behaviour of the various algorithms on input produced by this program. Empirical comparisons
made in this way are also difficult to generalize because of the wide range of possible variation,
both in the characteristics of the data (the structural characteristics of the synthetic databases
generated by QUEST are governed by a dozen of interacting parameters), and in the environment
in which the algorithms are being applied. It has also been noted (Brin et al., 1997) that data sets
produced using the QUEST generator might be inherently not the hardest to deal with. In fact
there is evidence that suggests that the performances of some algorithms on real data is much
worse than those found on synthetic data generated using QUEST (Zheng et al., 2001).
The purpose of this short contribution is two-fold. First of all additional arguments are
provided supporting the view that real-life databases show structural properties that are very
different from those of the data generated by QUEST. Secondly, a proposal is described for an
alternative data generator that is simpler and more realistic than QUEST. The arguments are
based on results described in Cooper & Zito (2007).
Heavy-tail distributions in Market Basket Databases
To support the claim that real market-basket databases show structural properties that are
quite different from those of the data generated by QUEST, Cooper and Zito analyzed
empirically the distribution of item occurrences in four real-world retail databases widely used as
test cases and publicly available from Figure 1 shows
such distribution (on a log-log scale) for two of these databases.
Figure 1. Log-log plots of the real-life data sets along with the best fitting lines
The authors suggest that in each case the empirical distribution may fit (over a wide range of
values) a heavy-tailed distribution. Furthermore they argue that the data generated by QUEST
shows quite different properties (even though it has similar size and density). When the empirical
analysis mentioned above is performed on data generated by QUEST (available from the same
source) the results are quite different from those obtained for real-life retail databases (see Figure
Figure 2. Log-log plots of the QUEST data sets along with the best fitting line
Differences have been found before (Zheng et al., 2001) in the transaction sizes of the real-life
vs. QUEST generated databases. However some of these differences may be ironed out by a
careful choice of the numerous parameters that controls the output of the QUEST generator. The
results of Cooper and Zito may point to possible differences at a much deeper level.
A closer look at QUEST
Cooper and Zito also start a deeper theoretical investigation of the structural properties of
the QUEST databases proposing a simplified version of QUEST whose mathematical properties
could be effectively analyzed. As the original program, this simplified version returns two
related structures: the actual database D and a collection T of potentially large item-sets (or
patterns) that is used to populate D. However in the simplified model, it is convenient to assume
that each transaction is formed by the union of k elements of T, chosen independently uniformly
at random (with replacement). The patterns in T are generated by first selecting a random set of
s items, and then, for each of the other patterns, by choosing (with replacement)  elements
uniformly at random from those belonging to the last generated pattern and s   remaining
elements uniformly at random (with replacement) from the whole set of items.
Let degD (v) (resp. degT (v) ) denote the number of transactions 
in D (resp. patterns in T)
containing item v . Assume that h , the total number of transactions, is a polynomial in n and
from the definition of the generation process given above that, for each
l 
n . It follows directly
item v ,
degD (v)  has
h 
k k (1)
pk,l     i E(degT (v) i ) , and the expected value of N r is n   ( pk,l ) r (1 pk,l ) hr .
i1 i
  l
Moreover, at least in the restricted case when s  2 , by studying the asymptotic distribution of
of 
degT (v)
k and large values of n , pk,l is
approximately 2  n 1 and N r is very close to its expected value. Hence for large r , the proportion
 
of items occurring in r transaction decays much faster than r z for any fixed z  0. For instance,
N  h 
if k 1, then r   ( pk,l ) r (1 pk,l ) hr .
r 
 n
An 
Alternative Proposal
Cooper and Zito study of synthetic database generator also points to possible alternatives
to the IBM generator. In fact, a much more effective way of generating realistic databases is
based on building the database sequentially, adding the transactions one at the time, choosing the
items in each transaction based on their (current) popularity (a mechanism known as preferential
attachment). The database model proposed by Cooper and Zito (which will be referred to as
CoZi from now on) is in line with the proposal of Barabasi & Albert (1999), introduced to
describe structures like the scientific author citation network or the world-wide web. Instead of
assuming an underlying set of patterns T from which the transactions are built up, the elements
of D are generated sequentially. At the start there is an initial set of e0 transactions on n0
existing items. CoZi can generate transactions based entirely on the n0 initial items, but in
general new items can also be added to newly defined transactions, so that at the end of the
simulation the total number of items is n  n0 . The simulationproceeds for a number of steps
generating groups of transactions at each step. For each group in the sequence there are four
choices made by the simulation
at step t :
1. The type of transaction. An OLD transaction (chosen with probability 1  ) consists of items
occurring in previous transactions. A NEW transaction (chosen with probability  ) consists of a
mix of new items and items occurring in previous transactions.
2. The number of transactions in a group, mO (t) (resp. mN (t) ) for OLD(resp. NEW) transactions.
This can be a fixed value, or given any discrete distribution with mean mO (resp. m N ). Grouping
 of a particular
 item in a group of transactions in the QUEST
corresponds to e.g. the persistence
3. The transaction size. This can again be a constant, or given by a probability distribution with
mean  .
4. The method of choosing the items in the transaction. If transactions of type OLD (resp. NEW)
are chosen in a step we assume that each of them is selected using preferential attachment with
probability PO (resp. PN ) and randomly otherwise.
The authors provide:
 1. a proof
 that the CoZi model is fit for its purposes, and
2. details of a simple implementation in Java, available from the authors web-sites.
More specifically, following Cooper (2006) they prove that, provided that the number of
transactions is large, with probability approaching one, the distribution of item occurrence in D
z  1
 mN ( 1)PN  (1  )mO  PO
. In other words, the number of items occurring r times after
( mN  (1  )mO )
t steps of the generation process is approximately Ctr z for large r and some constant C  0 .
Turning to examples, in the simplest case, the group sizes are fixed (say
 behaviour is the same
( PN  PO  P .
mN (t)  mO (t) 1 always) and the preferential attachment
Thus   1
, and z  1
. A simple implementation in Java of the CoZi generator,
  P
based on these settings (and the additional assumption that the transaction sizes are given by the
 absolute
Figure 3 displays item distribution plots
obtained from running the program with P  50% for different values of h and  .
Figure 3. Log-log plots of two CoZi data sets along with the best fitting line
While there are many alternative models for generating heavy tailed data (Mitzenmacher,
2004; Watts, 2004) and different communities may prefer to use alternative processes, we
contend that synthetic data generators of this type should be a natural choice for the testing of
ARM algorithms.
Given the ever increasing needs to store and analyse large amounts of data, we anticipate
that tasks like ARM or classification or pattern analysis will acquire increasing importance. In
this context it will be desirable to have sound mathematical models that could be used to
generate synthetic data or test cases. As a step in this direction, in our recent work, we looked at
models for market-basket databases. Another interesting area one could look at is that of
generators of classifier test cases. We believe that the Theory of Random Processes, and, more
generally, Probability Theory, could provide a range of techniques to devise realistic models and
prove interesting structural properties of the resulting data sets.
The association rule mining problem is a very important topic within the Data Mining
research field. We provided additional evidence supporting the claim that, although a large array
of algorithms and techniques exist to solve this problem, the testing of such algorithms is often
done resorting to un-realistic synthetic data generators. We also put forward an alternative
synthetic data generator that is simple to use, mathematically sound, and generates data that is
more realistic than the one obtained from other generators.
Association Rule Mining: The problem addressed is to identify a set of relations in a binary
valued attribute set which describe the likely coexistence of groups of attributes.
Best fitting line: On a data diagram, this is the line drawn as near as possible to the
various points so as to best represent the trend being graphed. The sums of the
displacements of the points on either side of the line should be equal.
Heavy-tailed distribution: A statistical distribution is said to have a heavy tail if the fraction of
the population up to a certain value x decays more slowly than ecx , for some c  0 , as x tends
to infinity.
Power-law distribution: 
A statistical distribution is said to follow a power 
law decay if the
fraction of the population up to a certain value x decays like x  c , for some c  0 , as x tends to
Preferential attachment: In a systemof numerical quantities, randomly modified
as time goes
by, it is a positive feedback mechanism by which larger increases tend to accumulate on already
large quantities.
Random Process: A random process is a sequence of random variables.