Angel algorithm: A novel global estimation level algorithm for discovering extended association rules between any variable types. Chatzigiannakis-Kokkidis Angelos - angel@economist.net Panteion University, 136 Syngrou Ave, Athens 17671, Greece Summary. The association rules analysis is a key part of the data mining methodology. Association rules represent interesting patterns between itemsets in a database. The research on this topic so far has produced many different algorithms that are more or only suitable to analyze either categorical or quantitative data. In this paper we present a new algorithm that can analyze itemsets that contain either quantitative or categorical data or a mix of them on either side of an association. Moreover both the independent and the dependent variables of an association can be any number and of any type. Therefore our algorithm can be used for analysis of both market-basket data and financial databases taking into account dimensional attributes, such as time, distance etc. 1. Introduction Association rules have been discussed thoroughly in the past. They were formally introduced in [AIS93] and were refined by the Apriori algorithm ([AS94]). Both algorithms addressed the market basked analysis problem trying to find all possible associations between sets of items, where each item represents a Boolean variable for the presence or absence of a product in the basket of a consumer. In [SA96] the authors of Apriori published the quantitative Apriori algorithm which extended the capabilities of the former to analyze categorized quantitative data. A very good algorithm that was proposed in the context of analyzing quantitative data for associations was the Window algorithm of Aumann and Lindell, published in [AUL03]. Aumann and Lindell provided a very good framework on categorizing the independent quantitative variables according to their impact on the quantitative dependent variables. Moreover the quantitative dependent variables are not categorized, because the authors use a statistical interestingness measure, such as the mean or the variance of the dependent variable. The algorithm described in this paper uses the Window algorithm to categorize the quantitative independent variables and extends the algorithm of Aumann and Lindell [AUL03] by the points described below: 950 1. Analysis of rules with multiple variables on both sides of the rule. 2. Introduction of the homogeneity measures. 3. Ability to analyze categorical and quantitative data on both sides, i.e. Categorical or Quantitative Categorical or Quantitative rules. 4. Ability to analyze intertransactional associations. 2. Definitions 2.1 Association Rules Let E = {e1,…,em} be the set of variables (or fields) for a database table D of transactions T. Let I = {i1,…,in} be a set of literals, called items of the following form: ik={ej , vL, vU} (1) with ej ∈ E, ik ∈ I and vL, vU two values of this variable, where vL ≤ vU if ej is a quantitative variable and vL = vU if ej is a categorical variable. For example ej can be the amount of a bank transaction and ik = {amount, 10.000, 20.000} a categorized expression of this variable, noted as item. Furthermore, if ej was a categorical variable such as the city of the account owner then ik = {city, ‘Athens’, ’Athens’} could be a valid item of this variable. For simplicity reasons we will also denote the above item as ik = {city, ‘Athens’} Let XQ be a set of quantitative-variable items in I. Also, let XC be a set of categoricalvariable items in I. Let YQ be a set of quantitative variables in E. Also let YC be a set of categorical variables in E. An association rule is an implication of the following form: X ⇒ Y, measure(Y) (2) X = {x1,…,xn1}, xr ∈ XQ U XC , and Y = {y1,…,yn2}, ys ∈ YQ U YC The left-hand side of a rule defines a subset of the database table D and is called profile. The profile can be a set of r number of items (noted as r-itemset) which are either categorical found in XC or quantitative found in XQ. For a transaction t = {〈e1,v1〉,…,〈em,vm〉}, we say that t has profile Xr if t coincides with Xr whenever Xr is defined. We denote this set of transactions as TXr. The right-hand side of a rule consists of a vector of measure values for a set of quantitative or categorical variables in YQ or YC respectively, with the measure taken over TX, the transactions which match the profile of the left-hand side of the rule. In this paper we use the mean value as the measure to define the significance of a rule; however other measures may be used as well. We denote as yTx the mean value of y ∈ Y for the profile TX. An r-itemset profile may contain other (r-1)-itemset profiles. We say that the (r-1)itemset profile Xr-1 is a superset of Xr if r > 1 and TXr ⊆ TXr-1. For 1-itemset profiles X1 their superset is the database table D denoted as TX1 ⊆ TD. Likewise we say that the profile Xr is a subset of the profile Xr-1. where 951 An example could be the following two itemsets: X2 = {age, 20, 29}, {smoker, yes} defines transaction set TX2 X1 = {age, 20, 29} defines transaction set TX1 with TX2 ⊆ TX1 A rule points to an extraordinary behavior of the variables in Y, if the measures (in our case: means) of those variables in the transactions TXr, defined by the profile, is significantly different from the same measures of the same variables in the transactions of a superset of the profile, TXr-1. Note that although the mean values may be different in the specific database, this may not be the case for the entire population, thus a statistical test has to be measure the significance level of the difference. In our case we assume that all variables follow the normal distribution, thus we use the Z-test. However the statistical test used may vary. Thus we define the significance of a rule as: MeanY(TXr) ≠ MeanY(TXr-1) (3) 2.2 Definitions for Intertransaction Association Rules Let E = C∪D∪E’ ={c1,…,cnc} ∪{d} ∪ {e1,…,ene) be the set of variables for a database table D of transactions t = {〈c1,vc1〉,…,〈cnc,vcnc〉,〈d,vd〉, 〈e1,ve1〉,…,〈ene,vene〉}. We will define the following types of variables (fields): Dimensional Variable d: This is a field in the database table D that defines the location in the dimensional space for each transaction t. Context Identifiers C: These are the fields in the database table D that define a unique entity value for the current analysis such that each value set {〈c1,vc1〉,…,〈cnc,vcnc〉} appears once and only once in the transaction set T for each value of the dimensional variable. General Variables E’: This is the set of all other variables in D that are neither dimensional nor context identifiers. 3. Algorithm The Angel algorithm described in this paper performs the following steps: (1) Create the extended table, (2) Create Boolean variables for all variables in YC, (3) Generate all items for itemsets XQ and XC by categorizing quantitative independent variables and finding all distinct values for categorical independent variables, (4) Find the candidate itemsets, (5) Prune itemsets containing non-large itemsets, (6) Find the large itemsets, (7) Find the rules between the itemsets in X and the variables in Y, (8) Perform a homogeneity check to decide which itemsets should be included in the next candidate itemset generation. The first step is based on the work done in [LFH00], although with many changes, the steps 4, 5, and 6 are based on [SA96] and the step 3 is based on [AUL03] with several changes to support multiple variables on the right side of the rule (Y). 952 3.1 Extend Dimensions The first step performed in the algorithm is to extend the table D in case the analyst has chosen to perform an intertransactional analysis. The algorithm uses a user defined value maxspan ∈ N which defines the depth of the dimensional analysis. If maxspan = 0 then this step of extending the table D is not performed. In this step the algorithm creates an extended table De (see example in table 2), which is based on the initial table D (see example in table 1) with a set Te of transactions each of which has the following form: te = {〈c1,vc1〉,…,〈ccn,vcn〉,〈d,vd〉, 〈Δ0(e1),Δ0(ve1)〉,…,〈Δ0(een),Δ0(ven)〉, 〈Δ1(e1),Δ1(ve1)〉,…,〈Δ1(een),Δ1(ven)〉, 〈Δ2(e1),Δ2(ve1)〉,…,〈Δ2(een),Δ2(ven)〉, … 〈Δmaxspan(e1), Δmaxspan(ve1)〉,…,〈 Δmaxspan(een), Δmaxspan(ven)〉} where (4) 〈Δ0(ei), Δ0(vei)〉 ∈ Te(k) = 〈ei,vei〉 ∈ T(k) and 〈Δj(ei), Δj(vei)〉 ∈ Te(k) = 〈ei,vei〉 ∈ T(k+j) In other words we denote with 〈Δ0(ei), Δ0(vei)〉 the value of the field ei in the current transaction and 〈Δj(ei), Δj(vei)〉 in the jth transaction where j is the distance (span) of this transaction from the current one. The extended transaction contains all the fields and values of the non-extended transaction plus the extended fields which come from the next transactions of the table D. Each extended transaction contains maxspan extended fields. Also, if the transaction in row k+j, T(k+j), has other value for one of the context identifier variables ci than transaction in row k, then the transaction k is not included in the extended table. Table 1. Example of initial database table D Client ID c1 Date d Account balance e1 Transaction amount e2 1051 10/1/2007 500 EUR 100 EUR 1051 11/1/2007 600 EUR 110 EUR 1052 5/1/2007 1830 EUR 40 EUR 1052 6/1/2007 1870 EUR -100 EUR 1052 7/1/2007 1770 EUR -200 EUR 1053 20/1/2007 1000 EUR 200 EUR After the extension, each transaction te of the extended database table De has all the variables needed for an intratransactional algorithm to perform the association rule analysis. Table 2. Extended Table De Client ID c1 Date d Account bal. Transaction Account bal. Transaction Δ0(e1) amount Δ0(e2) Δ1(e1) amount Δ1(e2) 953 1051 10/1/2007 500 EUR 100 EUR 600 EUR 110 EUR 1052 5/1/2007 1830 EUR 40 EUR 1870 EUR -100 EUR 1052 6/1/2007 1870 EUR -100 EUR 1770 EUR -200 EUR 3.2 Categorization The categorization of the quantitative independent variables in E is performed using the Window procedure described in [AUL03]. The Angel algorithm has three differences in the way it performs the Window procedure: a) It performs the procedure for each member of Y in order to support multiple dependent variables. b) It creates categories for window-below, window-above, and window-equal results. c) It performs the procedure to extract categorical items of the form ik (See Eq. 1). Therefore each quantitative independent variable has up to three different categorizations for each independent variable. All categories extracted from this process are added to XQ, the set of quantitative independent items. The pseudo-code of this process is shown in Figure 1. Fig. 1. The usage of the Window procedure to define variables in X foreach quantitative e in E do foreach y in Y do If not y is extended variable X = X U {x : x in Window-above(D,e,y,mindif)} X = X U {x : x in Window-below(D,e,y,mindif)} X = X U {x : x in Window-equal(D,e,y,mindif)} endfor endfor Additionally in order to support Quantitative Categorical rules the algorithm splits each categorical variable with c possible categories into c Boolean variables with value vc= 1 if the transaction contains the category c or value vc = 0 if the transaction does not contain the category c. The new Boolean variables are then treated like quantitative variables. 3.3 Rule Generation The rule generation process of Angel consists of the candidate itemsets generation, the large itemsets generation, the pruning phase, the rule generation, and the homogeneity check. The set of candidate itemsets Cn is the set of the Xn itemsets that will participate in the rule generation process of the iteration n. On the first iteration the algorithm creates the set C1 which is the set of all items in XC and XQ, noted as C1 = Xc ∪ XQ. Each item in C1 is a 1item itemset. On the next iterations the algorithm forms the candidate itemset as Cn = Nn-1 × L1. Nn-1 is a set of (n-1)-itemsets that is the result of the homogeneity check, to be described later, and L1 is the set of large 1-itemsets, also to be described below. 954 The set of large itemsets Ln is the set of the Xn itemsets that are found in Cn and pass the minimum support threshold. On each iteration a table scan is performed to determine which items of Cn contain the minimum number of transactions in the database. The process is similar to the one described in [AS94] and [SA96].Formally we denote this as: Xn ∈ Cn ∧ Count(TXn) ≥ minsup ⇒ Xn ∈ Ln (5) After the candidate itemset generation, a pruning of the candidate itemsets is performed. During this phase the algorithm checks if any n-itemset of Ci contains (n-1)-itemsets that are not contained in the Ln-1 set. If it finds such itemset it prunes it 1 . Formally we can describe this as: If ∃ Xn-1 ⊆ Xn ∧ Xn-1 ∉ Ln-1 ⇒ Xn ∉ Ci (6) The rule generation process finds all rules of the form of Eq. 2 which satisfy the criterion defined by the analyst as “measure”. As noted in section (2) we will use the mean criterion assuming that all variables in D follow the normal distribution. Thus RuleGen statistically compares the mean of each dependent variable in Y for the profile Xn against the mean of the same dependent variable for the profile of each superset of X. If the two means are statistically different, the rule X ⇒ y, mean(y) is added to the rules set S. Finally, in contrast with other algorithms, the current algorithm does not combine all large n-1 itemsets to find the candidate n-itemsets. It performs first a homogeneity check and combines only those itemsets that pass this check, and also adds them to the Nn set. To do so the algorithm performs a check on each itemset in Ln to determine the maximum statistical difference that any subset of this profile can have without taking into account any subordinate Xn+1 profile. In our case we use the max-Z criterion which is the maximum value of the Z-test perceived to conclude the equality between the mean of the Xn profile and a subordinate profile. If the maximum Z value that can be produced by the subordinate profiles is such that it is less than the maxZ defined by the analyst then the itemset Xn is not inserted into Nn although it might be a member of Ln. Formally we can describe this as follows: If not ∃ Xn+1 : Z-Test(yTxn+1, yTxn) > maxZ ∀ y ∈ Y ⇒ Xn ∉ Nn Fig. 2. The Angel Algorithm main loop 1 The pruning phase was first introduced in [SA94]. (7) 955 Angel (D, X, Y, mindif, minsup, maxspan, maxZ,toleratePercent) 1. D = ExtendDimensions(D) 2. AdjustCategories(D) 3. For each yi in Y do 4. μ(yi) ← Average (yi) in D ; add μ(yi) M1(Y) 5. σ(yi) Variance (yi) in D; add σ(yi) S1(Y) 6. For each quantitative variable e ∈ EX do 7. Categorize (e, Y, M1(Y), S1(Y), mindif) 8. C1 All items in X = XQ U XC 9. L1 LargeGen(C1) //find the first large itemsets array 10. S S U RuleGen(D,L1,X, Y, M1(Y), M0(Y),S0(Y), maxZ, mindif, minsup) 11. N1 CheckHomogen(L1,Y, M0(Y),S0(Y),maxZ,mindif,minsup) 12. n=2 13. While Nn-1 ≠ ∅ 14. { 15. Cn = Nn-1 x L1 //N1: Large itemsets checked for homogeneity (maxZ) 16. Cn = Prune (Cn, Ln-1) 17. Ln LargeGen(Cn) //find the first large itemsets array 18. S S U RuleGen(D,Ln,X, Y, Mn(Y), M0(Y),S0(Y), maxZ, mindif, minsup) 19. Nn CheckHomogen(Ln,Y, M0(Y),S0(Y),maxZ,mindif,minsup) 20. n = n +1 21. } 22. OutputRules (S, toleratePercent) 4. Experiments The database that was used for the test is a financial database with bank transactions and was previously used in the PKDD 1999 Challenge. Our dataset D1 is a profile of the 4.500 accounts in the database and contains information on an account level about properties of the account, age of the account owner and population of the city where the owner lives. It also contains summarized information on the transactions made through this account, such as the summary of insurance payments, interest payments and the average balance, without containing information on a transaction level, such as transaction date or amount. The rules produced by the algorithm strengthen the theoretical framework presented in this study. The following aspects are of a higher scientific interest: 1. The homogeneity check as a combination measure of the itemsets is efficient. 2. The different categorization of the quantitative independent variables is necessary for multivariate analysis. Fig. 3. Homogeneity check as a combination measure 956 No rule for: sex = 'M' => amount_loan Mean: -11993,3078 equal to 0 = 0 with mean -12278,2896 Z-Test: 0,2422 Itemset sex = 'M' - Variable amount_loan - Set Size 225 with maxZ: -36,3612 will be researched further No rule for: d_region = 'west Bohemia' => amount_loan Mean: -10537,1846 equal to 0 = 0 with mean -12278,2896 Z-Test: 0,8691 Itemset d_region = 'west Bohemia' - Variable amount_loan - Set Size 225 with maxZ: -2,9408 will be researched further Rule: sex = 'M' AND d_region = 'west Bohemia' => amount_loan Mean: 6041,0966 different from sex = 'M' with mean -11993,3078 Z-Test: 2,1902 Itemset sex = 'M' AND d_region = 'west Bohemia' - Variable amount_loan Set Size 225 with maxZ: -0,0729 will not be researched further 5. Conclusions The current algorithm provides a framework for a unified association rules multivariate analysis of any type of variable, either categorical or quantitative. However there has not been done any work in optimizing the algorithm’s execution time. In previous studies many methods have been proposed for optimization of the Apriori algorithm2 . However due to the change in the nature of the algorithm it is not possible to use the optimizations proposed for the Apriori algorithm without changes. Also, much of the execution time is spent on the Window procedure of the categorization of the independent quantitative variables, especially if the number of quantitative variables on the right-hand side is large or if the categorical variables of the right-hand side have many distinct values and therefore produce a large number of Boolean dependent variables. We combated this inefficiency by splitting one execution with more variables on each side into smaller executions with less variables on each side. The algorithm has been tested with datasets of up to 1 million transactions which is generally considered a small dataset. Performance in larger datasets may be poor and unacceptable especially if we include the categorization step. Future work has to be done on improving Angel’s performance. References [AIS93] Rakesh Agrawal, Tomasz Imielinski, Arun Swami. Mining association rules between sets of items in large databases. Proceedings of the 1993 ACM SIGMOD international conference on Management of data, Pages: 207 - 216, Washington, D.C., United States 1993 2 See [PYC97] for adjusting accuracy to increase performance, [FMMT96] for using parallel bucketing and sampling. 957 [AS94] Rakesh Agrawal, Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules in Large Databases. International Conference Very Large Data Bases (VLDB), pages 487-499, Santiago, Chile Sept. 1994 [AUL03] Yonatan Aumann, Yehuda Lindell. A Statistical Theory for Quantitative Association Rules. Journal of Intelligent Information Systems, Volume 20 , Issue 3, Pages: 255 - 283, May 2003 [LFH00] Hongjun Lu, Ling Feng, Jiawei Han. Beyond intratransaction association analysis: mining multidimensional intertransaction association rules. ACM Transactions on Information Systems (TOIS), Volume 18 , Issue 4 , Pages: 423 - 454, ACM Press New York, NY, USA October 2000 [SA96] Ramakrishnan Srikant, Rakesh Agrawal. Mining quantitative association rules in large relational tables. Proceedings of the 1996 ACM SIGMOD international conference on Management of data, Pages: 1 - 12, Montreal, Quebec, Canada 1996