Discovering Association Patterns based on Mutual Information Bon K. Sy1 1 Queens College/CUNY Computer Science Department Flushing NY 11367 U.S.A. bon@bunny.cs.qc.edu Abstract. Identifying and expressing data patterns in form of association rules is a commonly used technique in data mining. Typically, association rules discovery is based on two criteria: support and confidence. In this paper we will briefly discuss the insufficiency on these two criteria, and argue the importance of including interestingness/dependency as a criterion for (association) pattern discovery. From the practical computational perspective, we will show how the proposed criterion grounded on interestingness could be used to improve the efficiency of pattern discovery mechanism. Furthermore, we will show a probabilistic inference mechanism that provides an alternative to pattern discovery. Example illustration and preliminary study for evaluating the proposed approach will be presented. 1 Introduction In data mining, an association rule is typically expressed in form of A-> B. But the definition of an association rule may vary slightly among different disciplines and applications. For example, in philosophical logic [1] an association rule for two binary-valued logic variables A -> B with 80% certainty could mean 20% of the instances in its “frame of discernment” bear a relationship of (A: True B: False). In other words, the certainty factor is a measure of the “truthfulness” of a rule in the world. While in uncertain reasoning A -> B with 80% certainty means 80% chance B will happen if A happens; i.e., Pr(B|A) = 0.8. Yet in data mining an association rule A -> B could be associated with two measures: support and confidence; where support is a measure of significance of the presence of (A B) in the sample population of interests, while confidence is a measure of antecedence/consequence relationship much like uncertain reasoning. An example of such an association rule in data mining could be 80% of the movie goers for “The Lord of the Ring” went on to buy the book, and such a population accounts for 20% of the entire sample population. Support and confidence are two measures widely used in data mining with the objective of detecting data patterns that exhibit antecedence/consequence relationships. However, these two measures also present conceptual and computational challenges. Let’s consider the case of the above example. Let A=1 be the moviegoers watching “The Lord of the Ring”, and B=1 be the buyers of the book. Ideally, from the perspective of the utility of an association rule, we want both Pr(A=1 ∩ B=1) and Pr(B=1|A=1) to be high. Consider the case where Pr(A=1) = Pr(B=1) = 0.8, and Pr(A=1 ∩ B=1) = 0.64, we can easily see that the antecedence/consequence relationship Pr(B=1|A=1) = 0.8 is quite misleading since A and B are independent of each other in the event level (because Pr(B=1|A=1) = Pr(B=1) = 0.8). Even subtler, an association rule A -> B manifests an antecedence/consequence relationship that suggests a time precedence relationship; i.e., B happens after A. But let’s suppose the population is the English literature students who have an assignment on writing critiques about the story. Let’s assume C=1 represents the English literature students with such an assignment. It is then no surprise to expect that the antecedence/consequence relationships are indeed C -> A and C -> B. And since watching the movie prior to reading the book could save time on getting an idea about the story, it is natural that students may watch the movie first! But from the observed data, if we do not know about C=1, we may end up concluding A -> B, thus a fallacy on the situation. This situation is referred to as spurious association [2] that has been known for a long time in the philosophy community. It is well known that a fallacy due to spurious association can only be disproved; while we may never be able to prove the truthfulness of an association rule that manifests an antecedence/consequence relationship. Nevertheless, it is possible to examine the “interestingness” of an association about whether the events in a data pattern are independent of each other or not [3], [4]. The objective of this paper is to investigate information-statistical criteria for discovering data patterns that exhibit interesting association. Our primary goal is to introduce an information-statistical measure that bears an elegant statistical convergence property for discovering association patterns. The proposed approach is more than just adding another constraint. We will show how this could lead to reduction in the computational cost based on probabilistic inference of high order patterns from low order patterns. In section 2 we will first formulate the problem, and conduct an analysis on the complexity of discovering association rules/patterns. In section 3 the state-of-the-art approach for discovering association rules/patterns using a priori property will be discussed. In section 4 we will present a novel probabilistic inference approach based on model abstraction for reasoning high order association patterns from low order association patterns. In section 5 a modified a priori algorithm that integrates the probabilistic inference approach will be detailed. To evaluate the effectiveness of the proposed approach, a preliminary study using a data set about cereals will be reported in section 6. In section 7 we will summarize as the conclusion of this paper the contributions of this research. 2 Problem formulation and analysis Let X = {x1, x2, ..., xn} be a set of n categories, and D = {D1, D2, … Dn} be the domain set of the corresponding categories. A domain Di is a mutually exclusive set of items of category xi, including a null value, if necessary, to indicate no item selection from the category. For the sake of discussion, we will assume each domain carries m items; i.e., |D1| = |D2| = … = |Dn| = m. An item set transaction is represented by Di x Dj x … Dk; where {Di, Dj, … Dk} is a subset of D. Let T = {t1 … tn} be the set of all possible transactions. An association pattern is a transaction with at least two items. Let A = {a1 …. av} be the set of all possible association patterns. It is not difficult to find out the number of all possible association patterns v = ∑k= 2 n mk(n,k) = (m+1)n – mn – 1where (n,k) = n!/k!(n-k)!. Consider a case of 11 categories (i.e., n = 11) and m = 4, the number of possible association patterns is 511 – 45. In other words, the number of association patterns grows exponentially with the number of categories [5]. A k-tuple association pattern (k > 1) is an item set of k categories. This k-tuple association pattern will also be referred to as a pattern of kth-order. For a given k-tuple association pattern, there are ∑i= 1 k-1(k,i) possibilities on deriving an association rule. Since we have already mentioned the issue of spurious association, this paper will only focus on discovering significant association patterns rather than association rules. Even if our focus is to discover significant association patterns, we need to answer a fundamental question: what properties are desirable for a significant association patterns? In other words, what association patterns should be considered significant? In this research, an association pattern ai consisting of items {i1, i2, … ip} is considered α-significant if it satisfies the following conditions: 1. The support for ai, defined as Pr(ai), is at least α; i.e., Pr(ai) ≥ α. (C1) 2. The interdependency of {i1, i2, … ip} as measured by mutual information measure MI(ai) = Log2 Pr(i1, i2, … ip)/Pr(i1)Pr(i2)… Pr(ip) is significant. (C2) As reported elsewhere [6], [7], mutual information measure asymptotically converges to χ2. A convenient way to determine whether MI(ai) is significant is to compare the mutual information measure with χ2 measure; i.e., MI(ai) is significant if MI(ai) ≥ β(χ2) γ; where β and γ are some scaling factors, and due to Pearson, χ2 = (oi – ei)2/ei. In other words, to determine whether any one of the (m+1)n – mn – 1 association patterns is significant or not, we test it against the above two conditions. Clearly this is computationally prohibitive if we have to test all the patterns against the two conditions above. Fortunately the famous a priori property [8], [9] allows us to prune away patterns in a lattice hierarchy that are extensions of a pattern, but did not survive the test against the first condition (C1) just mentioned. 3 State-of-the-art: A priori and Mutual Information Measure An association pattern is basically a collection of items. Suppose there is a 2-tuple association pattern a1 = {d1, d2}; where d1 is an item element of the set D1, and d2 is an item element of the set D2. We can consider an association pattern as an event in a probability space with random variables x1 assuming the value d1, and x2 assuming the value d2; i.e., Pr(a1) = Pr(x1:d1 ∩ x2:d2). An extension ea1 of a pattern a1 is a pattern consisting of an item set D’ that is a proper superset of {d1, d2}; i.e., {d1, d2} D’. It is not difficult to observe the property: Pr(a1) ≥ Pr(ea1) since Pr(a1) = ∑D’ – {d1 d2} Pr(ea1). Therefore, if a1 is not α-significant because Pr(a1) < α, ea1cannot be α-significant, thus facilitating a pruning criterion during the process of identifying significant association patterns --- the essence of a priori property. On the other hand, if the mutual information measure of a1 is not significant, it does not guarantee the extension of a1 not significant. Consider ea1= {x1 x2 x3}, if Pr(x1:d1 ∩ x2:d2 ∩ x3:d3)/Pr(d3) > Pr(x1:d1 ∩ x2:d2), Pr(x1:d1 ∩ x2:d2 ∩ x3:d3) > Pr(x1:d1)Pr(x2:d2)Pr(x3:d3), and Pr(x1:d1 ∩ x2:d2) > Pr(x1:d1)Pr(x2:d2), then MI(ea1) > MI(a1). Furthermore, it is possible that an association pattern satisfies (C1), but fails (C2) (mutual information measure). Therefore, (C2) provides a complementary pruning criterion for discovering significant association patterns. In the process of deriving significant association patterns, we need one pass on all the transaction records to obtain the marginal probabilities required for mutual information measure. To identify second order (2-tuple) association patterns, we need to permute every pair of items in a transaction record and keep track the frequency information in the same first pass [10]. The frequency information is then used to derive the joint probability information needed for mutual information measure and for determining α-significant. At the end of the pass, we can then determine what association patterns --- as well as the patterns that are the extensions --- to discard, before the commencement of the next pass for identifying third-order patterns. In each pass, the complexity is proportional to number of transaction records. In many applications such as on-line shopping, the number of transaction records tends to be very large. In such a case, the computational cost for deriving significant association patterns could be high even the complexity is linear with respect to the number of transaction records. A fundamental question is whether we could deduce high order association patterns from low order patterns without the need of repetitively scanning the transaction records. This is particularly so should the number of transaction records be large. To answer this question, we explore a novel model abstraction process that permits probabilistic inference on high order association patterns. 4 Model abstraction for probabilistic inference Let’s consider a case of 11 discrete random variables (categories) {x1, … x11} and the domain of each variable consists of 4 states; i.e., xi can assume a value from a set {1 2 3 4} for i = 1 .. 11. Let’s further assume (x1:1 x2:1), (x1:1 x3:1), and (x2:1 x3:1) have been identified as significant association patterns. We want to know whether the extension (x1:1 x2:1 x3:1) is a significant association pattern. A naïve approach is to conduct another scanning pass to obtain the frequency information for α-significant test and mutual information measure. At the time (x1:1 x2:1), (x1:1 x3:1), and (x2:1 x3:1) are determined as significant association patterns, we would have already obtained the information of all marginal probabilities Pr(xi) (where i = 1.. 11), and the joint probabilities Pr(x1:1, x2:1), Pr(x1:1 x3:1), and Pr(x2:1 x3:1). Let’s assume Pr(x1:1) = 0.818, Pr(x2:1) = 0.909, Pr(x3:1) = 0.42, Pr(x1:1∩ x2:1) = 0.779, Pr(x1:1 x3:1) = 0.364, and Pr(x2:1 x3:1) = 0.403. Pr(x1:1 ∩ x2:1 ∩ x3:1) is the only missing information needed for determining whether (x1:1 x2:1 x3:1) is a significant association pattern. Suppose the value of α used for α-significant test is 0.2, if (x1:1 x2:1 x3:1) is a significant association pattern, it must satisfy the following conditions: Pr(x1:1) = 0.818 ∑x2 x3 Pr(x1:1 ∩ x2 ∩ x3) = 0.818 Pr(x2:1) = 0.909 ∑x1 x3 Pr(x1 ∩ x2:1 ∩ x3) = 0.909 Pr(x3:1) = 0.42 ∑x2 x3 Pr(x1 ∩ x2 ∩ x3:1) = 0.42 Pr(x1:1∩ x2:1) = 0.779 ∑ x3 Pr(x1:1 ∩ x2:! ∩ x3) = 0.779 Pr(x1:1∩ x3:1) = 0.364 ∑ x2 Pr(x1:1 ∩ x2 ∩ x3:1) = 0.364 Pr(x2:1∩ x3:1) = 0.403 ∑ x1 Pr(x1 ∩ x2:1 ∩ x3:1) = 0.403 Pr(x1:1∩ x2:1∩ x3:1) ≥ 0.2 Pr(x1:1 ∩ x2:1 ∩ x3:1) - S = 0.2 where S is a non-negative slack variable ∑x1 x2 2 x3 Pr(x1 ∩ x2 ∩ x3) = 1 Although the domain of each variable x1, x2, and x3 consist of 4 states, we are interested in only one particular state of the variable; namely, x1 = 1, x2 = 1, and x3=1. We can define a new state 0 to represent the irrelevant states {2, 3, 4}. In other words, the above example consists of only 23 = 8 joint probability terms rather than 43 = 64 joint terms, thus reducing the number of dimensions. In the above example, there are eight equality constraints and nine unknowns (one for each joint probability term and a slack variable). It is an underdetermined algebraic system that has multiple solutions; where a solution is a vector of size = 9. Among all the solutions, one corresponds to the true distribution that we are interested in. As discussed in our previous research [11], the underdetermined algebraic system provides a basis for formulating an optimization problem that aims at maximizing the likelihood estimate of the statistical distribution of the data. Although the probabilistic inference approach just demonstrated offers an alternative to scanning the transaction records, there are three related questions about its utility. First, under what circumstances probabilistic inference approach is more attractive in comparing to a straightforward scanning? Second, how feasible and expensive is it computationally on solving the optimization problem? Third, how accurate is the estimate of the joint probability information (for example, Pr(x1:1 ∩ x2 ∩ x3) in the above case)? To answer the first question, we first note that probabilistic inference is applied only to the high order association patterns that we are interested in. But unless the order of association patterns is relatively low, the process of probabilistic inference has to be applied one-at-a-time to each association pattern that we are interested in. Therefore, probabilistic inference approach will have a distinct advantage over a straightforward scanning when (1) the number of transaction records is large, (2) each transaction record consists of a large number of categories, and (3) only few high order association patterns are of interests. As we reported elsewhere [11], the problem of probabilistic inference formulated as an optimization problem under the principle of minimum biased information can be solved quite efficiently. In practice, we can solve an optimization problem with 300 some variables within a minute using a 450MMX HZ personal computer. For data mining problems, 300 some variables translates to the 8th-order association patterns (i.e., trunc(Log2300)). In practice, it is highly unlikely to have significant association patterns with an order of seven or above. The third question is perhaps the most challenging one. From the perspective of computational geometry, probabilistic inference is a search process in a high dimensional probability sub-space defined by the (in)equality constraints [12]. The error percentage defined by the normalized distance between the estimated optimal joint probability and the true joint probability increases as the order of association patterns increases. This is because the joint probability (support) of the association patterns decreases as the order increases, thus increasing the error sensitivity. As a result, when the estimated joint probability of an association pattern is used in mutual information measure to determine its significance, the asymptotic convergence of mutual information measure towards chi-square distribution will need to be calibrated. As reported elsewhere [6], [7], mutual information measure of two random variables (x1 x2) has the following asymptotic convergence property: I(x1: x2) -> χ2 (K-1)(J-1)(1- α)/2N; where K and J are the number of states of x1 and x2 respectively, N is the sample population size, and α is the significance level. The calibration for adjusting the error sensitivity of the joint probability as it is used in calculating the mutual information measure of a high order association pattern MI(x1 x2 .. xn) in the event level is shown below: Eˆ 1 2 ( E ' )0 / 2 MI ( x1, x 2... xn) ( )( ) Pr( x1, x 2... xn) 2 N (1) where MI(x1,x2…xn) = Log2Pr(x1 x2 … xn)/Pr(x1)Pr(x2)…Pr(xn) N = sample population size χ2 = Pearson chi-square test statistic defined as (oi – ei)2/ei with oi = observed count = N Pr(x1 x2 .. xn) ei = expected count under the assumption of independence = N Pr(x1)Pr(x2)...Pr(xn) E = Expected entropy measure of estimated probability model E’ = Maximum possible entropy of estimated probability model O = order of the association pattern (i.e., n in this case) Referring to the previous example, the optimal solution that maximizes the likelihood estimate under the assumption of minimum biased information is [Pr(x1:0 ∩ x2:0 ∩ x3:0) = 0.035, Pr(x1:0 ∩ x2:0 ∩ x3:1) = 0.017, Pr(x1:0 ∩ x2:1 ∩ x3:0) = 0.091, Pr(x1:0 ∩ x2:1 ∩ x3:1) = 0.039, Pr(x1:1 ∩ x2:0 ∩ x3:0) = 0.039, Pr(x1:0 ∩ x2:0 ∩ x3:1) = 0, Pr(x1:0 ∩ x2:1 ∩ x3:0) = 0.415, Pr(x1:1 ∩ x2:1 ∩ x3:1) = 0.364]. The expected entropy measure of estimated probability model E = -∑x1 x2 2 x3 Pr(x1 ∩ x2 ∩ x3) Log2 Pr(x1 ∩ x2 ∩ x3) =2.006223053. The maximum possible entropy of estimated probability model E’ is the case of even distribution; i.e., E’ = -∑x1 x2 2 x3 Pr(x1 ∩ x2 ∩ x3) Log2 Pr(x1 ∩ x2 ∩ x3) = 3. There is an interesting observation about the heuristics of the above equation. Let’s consider the case of second-order association patterns; i.e., o=2. When the expected entropy measure of estimated probability model is identical to that of maximum likelihood estimate, Pr(x1, x2) Log2Pr(x1, x2)/Pr(x1)Pr(x2) → χ2/2N. If we now sum up all possible association patterns defined by (x1, x2) to examine the mutual information measure in the variable level (as opposed to the event level), we will obtain the asymptotic convergence property: I(x1: x2) -> χ2/2N as discussed earlier. 5 Modified a priori algorithm Based on the methods discussed in the previous sections, below is an algorithm that combines a priori property with mutual information measure for identifying significant association patterns: Step 1: Conduct a scanning pass to derive the marginal probabilities Pr(xi = dk) (i = 1..n) for all possible dks, and the joint probabilities Pr(xi = dl, xj = dm) (i<j, i=1..n-1, j=2..n) for all possible dls and dms by checking each transaction record one at a time. Remark: This can be easily achieved by creating a bin as a place holder of frequency count for each unique xi and (xi xj) [12], and discard the bin (xi xj) when its frequency count at the time of k% completion of transaction record scanning is less than N(α - 1 + k/100) --- a condition that guarantees the frequency count to be less than the threshold α defined for α-significant. Step 2: Rank all w (≤ n(n-1)/2) association patterns (xi, xj) survived in (i) step 1, and (ii) the test due to (C2) about mutual information measure, in the descending order of the corresponding joint probabilities, and put in a collection set AS. Step 3: Select w’ (≤ w) association patterns from the top of AS, and enumerate each association pattern (referred to as a source pattern) with a new item Ij from a category/attribute variable not already in the association pattern that satisfies the following condition: Every second-order association pattern formed by Ij and an item in its source pattern is a significant association pattern in AS. For example, suppose the source pattern is (x1:d1, x2:d2), it can be enumerated to (x1:d1, x2:d2, xj:Ij) if both (x1:d1, xj:Ij) and (x2:d2, xj:Ij) are significant association patterns. Step 4: Based on the number of newly enumerated patterns and the order of the patterns, determine according to the scenario discussed in the previous section whether the joint probabilities for the newly enumerated patterns should be derived from a new pass of transaction record scanning or probabilistic inference described earlier. In either case, proceed to derive the joint probabilities for the newly enumerated patterns and test against the condition (C1) in section 2. If a pattern does not pass the test, discard it from the list for further processing. Step 5: For each newly enumerated association pattern survived in step 4, test against the condition (C2) (mutual information measure) in section 2. If a pattern passes the test, insert the newly enumerated significant association pattern into a temporary bin TB in such a way that the descending order of joint probabilities of the patterns in TB is preserved. Step 6: Insert the items in TB to the top of AS. If the computational resources are still available, empty TB and go to step 3. Otherwise stop and return AS. 6 Preliminary study and result discussion In order to better understand the computational behavior of the proposed approach discussed in this paper, a preliminary study was conducted using a dataset about different brands of cereals. This dataset was originally published in the anonymous ftp from unix.hensa.ac.uk, and re-distributed as cereal.tar.gz/cereal.zip by [13]. This dataset is chosen because it is relatively small to allow an exhaustive data analysis to establish a “ground truth” for the purpose of evaluation. This dataset consists of 77 records. Each record consists of 11 categories/attributes. The number of possible second-order association patterns, therefore, is 42(11x10)/2 = 880. In this preliminary study, we set α = 0.2 for α-significant test. 57 out of 880 association patterns survived the test due to condition (C1). Among the 57 association patterns, 15 failed the test due to (C2) (mutual information measure). Based on the extension of the 42 second-order significant association patterns, third-order association patterns were derived and 25 of the third-order patterns survived the test due to (C1). Among the 25 association patterns, 19 passed the test due to (C2). Based on the 19 third-order significant association patterns, three significant association patterns of fourth-order were found. This completes the construction of the “ground truth” for evaluation. To evaluate how effective is the proposed algorithm presented in section 5, applying step 1 of the algorithm produced the same set of second-order association patterns. In step 2, we chose w = 1; i.e., only the most probable significant association pattern (Pr(x9:2, x10:3) = 0.779) was used for enumeration of third-order association pattern candidates. Following the condition stipulated in step 3, eight candidates of thirdorder association patterns were found. Among the eight candidates, five of the 19 actual third-order significant association patterns were found. In other words, we were able to find 26% (5/19) of the third-order significant association patterns using only 2% (1/42) of the candidate set for enumeration. To understand better the behavior of probabilistic inference, we repeated step 4 except that probabilistic inference is applied on the same eight candidates rather than scanning the dataset. The following results were found. Table 1. Comparison between using probabilistic inference vs exhaustive scan Case Association pattern Mutual infor- Adjusted MI > C ? Ground truth 1 2 3 4 5 6 7 8 x1:3 x3:1 x3:2 x4:3 x6:2 x7:2 x7:3 x9:2 x9:2 x10:3 x9:2 x10:3 x9:2 x10:3 x9:2 x10:3 x9:2 x10:3 x9:2 x10:3 x9:2 x10:3 x10:3 x11:3 mation MI 0.315 -0.005484 0.135 0.221 0.391 0.178 0.218 0.211 chi-square C 0.208 0.003804 0.085 0.135 0.311 0.143 0.194 0.139 Yes No Yes Yes Yes Yes Yes Yes No No Yes Yes Yes No Yes Yes When the seven cases where MI > C in table 1 are used to enumerate fourth-order patterns, 11 such patterns are obtained. Among the 11 fourth-order patterns, two of the three true significant association patterns are covered. When the probabilistic inference was applied again, only one of the two true fourthorder significant association patterns was found. This leads to a 50% false-negative error rate. Among the nine cases that were not significant association patterns, probabilistic inference process drew the same conclusion in six cases, yielding a 33% falsepositive error rate. This results in a weighted error rate of (2/11)0.5% + (9/11)0.333% = 36%, or a 64% accuracy rate. As also noted in the study, the condition stipulated in step 3 plays an essential role in maintaining the enumeration space small. 42 second-order significant association patterns were found. An exhaustive enumeration of 42 second-order patterns will yield at least 42x(11-2)x4 – 42 = 1470 third-order association pattern candidates. In our study we used only one of the 42 patterns for an enumeration. This results in (1x(112)x4) 36 possible third-order pattern candidates while the condition stipulated in step 3 restricted the enumeration to only eight third-order pattern candidates. 7 Conclusion This paper discussed new criteria based on mutual information measure for defining significant association patterns, and a novel probabilistic inference approach utilizing model abstraction for discovering significant association patterns. The new criteria are proposed to address the interestingness, defined by interdependency among the attributes, of an association pattern. The novel probabilistic inference approach is introduced to offer an alternative approach to deduce the essential information needed for discovering significant patterns without the need of an exhaustive scan of the entire database. The preliminary study has showed interesting results. Our follow-up study will focus on applying the proposed approach to real world data sets. Acknowledgement: This work is supported in part by PSC CUNY Research Award and NSF DUE CCLI #0088778. References 1. Genesereth M., Nilsson N.: Logical Foundations of Artificial Intelligence. Morgan Kaufmann (1987) 2 Freedman, D.: From association to causation: Some remarks on the history of statistics. Statistical Science 14 Vol. 3 (1999) 243-258 3. Cover T.M., Thomas J.A.: Elements of Information Theory. New York: John Wiley & Sons (1991) 4. Rish I., Hellerstein J., Jayram T.: An Analysis of Data Characteristics that affect Naive Bayes Performance. Technical Report RC21993, IBM T.J. Watson Research Center (2001) 5. Yang J., Wang W., Yu P.S., Han J.: Mining Long Sequential Patterns in a Noisy Environment. ACM SIGMOD June 4-6, Madison, Wisconsin (2002) 406-417 6. Kullback S.: Information Theory and Statistics. John Wiley & Sons Inc (1959) 7. Basharin G.: Theory of Probability and its Applications. Vol. 4 (1959) 333-336 8. Agrawal R., Imielinski T., Swami A.: Mining Association Rules between Sets of Items in large Databases. Proc. ACM SIGMOD Conf. Washington DC, May (1993) 9. Agrawal R., Srikant R.: Fast Algorithms for Mining Association Rules. VLDDBB (1994) 487-499 10. Toivonen H.: Sampling Large Databases for Association Rules. Proc. 22nd VLDB (1996) 134-145 11. Sy B.K.: Probability Model Selection Using Information-Theoretic Optimization Criterion. J. of Statistical Computing & Simulation, Gordan & Breach. V69-3 (2001) 12. Hoeffding W.: Probability Inequalities for sums of bounded Random Variables. Journal of the American Statistical Associations. Vol. 58 (1963) 13-30 13. Zaki M.: SPADE: an efficient algorithm for Mining Frequent Sequences. Machine Learning Journal, Vol. 42-1/2 (2001) 31-60 14. www http://davis.wpi.edu/~xmdv/datasets.html