Angel algorithm: A novel global estimation level between any variable types.

advertisement
Angel algorithm: A novel global estimation level
algorithm for discovering extended association rules
between any variable types.
Chatzigiannakis-Kokkidis Angelos - angel@economist.net
Panteion University, 136 Syngrou Ave, Athens 17671, Greece
Summary. The association rules analysis is a key part of the data mining methodology. Association
rules represent interesting patterns between itemsets in a database. The research on this topic so far
has produced many different algorithms that are more or only suitable to analyze either categorical or
quantitative data. In this paper we present a new algorithm that can analyze itemsets that contain either quantitative or categorical data or a mix of them on either side of an association. Moreover both
the independent and the dependent variables of an association can be any number and of any type.
Therefore our algorithm can be used for analysis of both market-basket data and financial databases
taking into account dimensional attributes, such as time, distance etc.
1. Introduction
Association rules have been discussed thoroughly in the past. They were formally introduced in [AIS93] and were refined by the Apriori algorithm ([AS94]). Both algorithms addressed the market basked analysis problem trying to find all possible associations between
sets of items, where each item represents a Boolean variable for the presence or absence of
a product in the basket of a consumer. In [SA96] the authors of Apriori published the quantitative Apriori algorithm which extended the capabilities of the former to analyze categorized quantitative data.
A very good algorithm that was proposed in the context of analyzing quantitative data
for associations was the Window algorithm of Aumann and Lindell, published in [AUL03].
Aumann and Lindell provided a very good framework on categorizing the independent
quantitative variables according to their impact on the quantitative dependent variables.
Moreover the quantitative dependent variables are not categorized, because the authors use
a statistical interestingness measure, such as the mean or the variance of the dependent
variable.
The algorithm described in this paper uses the Window algorithm to categorize the quantitative independent variables and extends the algorithm of Aumann and Lindell [AUL03]
by the points described below:
950
1. Analysis of rules with multiple variables on both sides of the rule.
2. Introduction of the homogeneity measures.
3. Ability to analyze categorical and quantitative data on both sides, i.e. Categorical or
Quantitative Categorical or Quantitative rules.
4. Ability to analyze intertransactional associations.
2. Definitions
2.1 Association Rules
Let E = {e1,…,em} be the set of variables (or fields) for a database table D of transactions
T. Let I = {i1,…,in} be a set of literals, called items of the following form:
ik={ej , vL, vU}
(1)
with ej ∈ E, ik ∈ I and vL, vU two values of this variable, where vL ≤ vU if ej is a quantitative
variable and vL = vU if ej is a categorical variable. For example ej can be the amount of a
bank transaction and ik = {amount, 10.000, 20.000} a categorized expression of this variable, noted as item. Furthermore, if ej was a categorical variable such as the city of the account owner then ik = {city, ‘Athens’, ’Athens’} could be a valid item of this variable. For
simplicity reasons we will also denote the above item as ik = {city, ‘Athens’}
Let XQ be a set of quantitative-variable items in I. Also, let XC be a set of categoricalvariable items in I. Let YQ be a set of quantitative variables in E. Also let YC be a set of
categorical variables in E.
An association rule is an implication of the following form:
X ⇒ Y, measure(Y)
(2)
X = {x1,…,xn1}, xr ∈ XQ U XC , and
Y = {y1,…,yn2}, ys ∈ YQ U YC
The left-hand side of a rule defines a subset of the database table D and is called profile.
The profile can be a set of r number of items (noted as r-itemset) which are either categorical found in XC or quantitative found in XQ. For a transaction t = {⟨e1,v1⟩,…,⟨em,vm⟩}, we
say that t has profile Xr if t coincides with Xr whenever Xr is defined. We denote this set of
transactions as TXr.
The right-hand side of a rule consists of a vector of measure values for a set of quantitative or categorical variables in YQ or YC respectively, with the measure taken over TX, the
transactions which match the profile of the left-hand side of the rule. In this paper we use
the mean value as the measure to define the significance of a rule; however other measures
may be used as well. We denote as yTx the mean value of y ∈ Y for the profile TX.
An r-itemset profile may contain other (r-1)-itemset profiles. We say that the (r-1)itemset profile Xr-1 is a superset of Xr if r > 1 and TXr ⊆ TXr-1. For 1-itemset profiles X1 their
superset is the database table D denoted as TX1 ⊆ TD. Likewise we say that the profile Xr is
a subset of the profile Xr-1.
where
951
An example could be the following two itemsets:
X2 = {age, 20, 29}, {smoker, yes} defines transaction set TX2
X1 = {age, 20, 29} defines transaction set TX1 with TX2 ⊆ TX1
A rule points to an extraordinary behavior of the variables in Y, if the measures (in our
case: means) of those variables in the transactions TXr, defined by the profile, is significantly different from the same measures of the same variables in the transactions of a superset of the profile, TXr-1. Note that although the mean values may be different in the specific database, this may not be the case for the entire population, thus a statistical test has to
be measure the significance level of the difference. In our case we assume that all variables
follow the normal distribution, thus we use the Z-test. However the statistical test used may
vary. Thus we define the significance of a rule as:
MeanY(TXr) ≠ MeanY(TXr-1)
(3)
2.2 Definitions for Intertransaction Association Rules
Let E = C∪D∪E’ ={c1,…,cnc} ∪{d} ∪ {e1,…,ene) be the set of variables for a database
table D of transactions t = {⟨c1,vc1⟩,…,⟨cnc,vcnc⟩,⟨d,vd⟩, ⟨e1,ve1⟩,…,⟨ene,vene⟩}. We will define
the following types of variables (fields):
Dimensional Variable d: This is a field in the database table D that defines the location
in the dimensional space for each transaction t.
Context Identifiers C: These are the fields in the database table D that define a unique
entity value for the current analysis such that each value set {⟨c1,vc1⟩,…,⟨cnc,vcnc⟩} appears
once and only once in the transaction set T for each value of the dimensional variable.
General Variables E’: This is the set of all other variables in D that are neither dimensional nor context identifiers.
3. Algorithm
The Angel algorithm described in this paper performs the following steps:
(1) Create the extended table, (2) Create Boolean variables for all variables in YC, (3) Generate all items for itemsets XQ and XC by categorizing quantitative independent variables
and finding all distinct values for categorical independent variables, (4) Find the candidate
itemsets, (5) Prune itemsets containing non-large itemsets, (6) Find the large itemsets, (7)
Find the rules between the itemsets in X and the variables in Y, (8) Perform a homogeneity
check to decide which itemsets should be included in the next candidate itemset generation.
The first step is based on the work done in [LFH00], although with many changes, the
steps 4, 5, and 6 are based on [SA96] and the step 3 is based on [AUL03] with several
changes to support multiple variables on the right side of the rule (Y).
952
3.1 Extend Dimensions
The first step performed in the algorithm is to extend the table D in case the analyst has
chosen to perform an intertransactional analysis. The algorithm uses a user defined value
maxspan ∈ N which defines the depth of the dimensional analysis. If maxspan = 0 then this
step of extending the table D is not performed.
In this step the algorithm creates an extended table De (see example in table 2), which is
based on the initial table D (see example in table 1) with a set Te of transactions each of
which has the following form:
te = {⟨c1,vc1⟩,…,⟨ccn,vcn⟩,⟨d,vd⟩, ⟨Δ0(e1),Δ0(ve1)⟩,…,⟨Δ0(een),Δ0(ven)⟩,
⟨Δ1(e1),Δ1(ve1)⟩,…,⟨Δ1(een),Δ1(ven)⟩, ⟨Δ2(e1),Δ2(ve1)⟩,…,⟨Δ2(een),Δ2(ven)⟩,
…
⟨Δmaxspan(e1), Δmaxspan(ve1)⟩,…,⟨ Δmaxspan(een), Δmaxspan(ven)⟩}
where
(4)
⟨Δ0(ei), Δ0(vei)⟩ ∈ Te(k) = ⟨ei,vei⟩ ∈ T(k) and
⟨Δj(ei), Δj(vei)⟩ ∈ Te(k) = ⟨ei,vei⟩ ∈ T(k+j)
In other words we denote with ⟨Δ0(ei), Δ0(vei)⟩ the value of the field ei in the current
transaction and ⟨Δj(ei), Δj(vei)⟩ in the jth transaction where j is the distance (span) of this
transaction from the current one. The extended transaction contains all the fields and values
of the non-extended transaction plus the extended fields which come from the next transactions of the table D. Each extended transaction contains maxspan extended fields. Also, if
the transaction in row k+j, T(k+j), has other value for one of the context identifier variables
ci than transaction in row k, then the transaction k is not included in the extended table.
Table 1. Example of initial database table D
Client ID c1 Date d
Account balance e1 Transaction amount e2
1051
10/1/2007
500 EUR
100 EUR
1051
11/1/2007
600 EUR
110 EUR
1052
5/1/2007
1830 EUR
40 EUR
1052
6/1/2007
1870 EUR
-100 EUR
1052
7/1/2007
1770 EUR
-200 EUR
1053
20/1/2007
1000 EUR
200 EUR
After the extension, each transaction te of the extended database table De has all the variables needed for an intratransactional algorithm to perform the association rule analysis.
Table 2. Extended Table De
Client ID c1
Date d
Account bal. Transaction Account bal. Transaction
Δ0(e1) amount Δ0(e2) Δ1(e1) amount Δ1(e2)
953
1051
10/1/2007
500 EUR
100 EUR
600 EUR
110 EUR
1052
5/1/2007
1830 EUR
40 EUR
1870 EUR -100 EUR
1052
6/1/2007
1870 EUR
-100 EUR
1770 EUR -200 EUR
3.2 Categorization
The categorization of the quantitative independent variables in E is performed using the
Window procedure described in [AUL03]. The Angel algorithm has three differences in the
way it performs the Window procedure:
a) It performs the procedure for each member of Y in order to support multiple dependent
variables.
b) It creates categories for window-below, window-above, and window-equal results.
c) It performs the procedure to extract categorical items of the form ik (See Eq. 1).
Therefore each quantitative independent variable has up to three different categorizations
for each independent variable. All categories extracted from this process are added to XQ,
the set of quantitative independent items. The pseudo-code of this process is shown in Figure 1.
Fig. 1. The usage of the Window procedure to define variables in X
foreach quantitative e in E do
foreach y in Y do
If not y is extended variable
X = X U {x : x in Window-above(D,e,y,mindif)}
X = X U {x : x in Window-below(D,e,y,mindif)}
X = X U {x : x in Window-equal(D,e,y,mindif)}
endfor
endfor
Additionally in order to support Quantitative
Categorical rules the algorithm splits
each categorical variable with c possible categories into c Boolean variables with value vc=
1 if the transaction contains the category c or value vc = 0 if the transaction does not contain
the category c. The new Boolean variables are then treated like quantitative variables.
3.3 Rule Generation
The rule generation process of Angel consists of the candidate itemsets generation, the
large itemsets generation, the pruning phase, the rule generation, and the homogeneity
check.
The set of candidate itemsets Cn is the set of the Xn itemsets that will participate in the
rule generation process of the iteration n. On the first iteration the algorithm creates the set
C1 which is the set of all items in XC and XQ, noted as C1 = Xc ∪ XQ. Each item in C1 is a 1item itemset. On the next iterations the algorithm forms the candidate itemset as Cn = Nn-1 ×
L1. Nn-1 is a set of (n-1)-itemsets that is the result of the homogeneity check, to be described
later, and L1 is the set of large 1-itemsets, also to be described below.
954
The set of large itemsets Ln is the set of the Xn itemsets that are found in Cn and pass the
minimum support threshold. On each iteration a table scan is performed to determine which
items of Cn contain the minimum number of transactions in the database. The process is
similar to the one described in [AS94] and [SA96].Formally we denote this as:
Xn ∈ Cn ∧ Count(TXn) ≥ minsup ⇒ Xn ∈ Ln
(5)
After the candidate itemset generation, a pruning of the candidate itemsets is performed.
During this phase the algorithm checks if any n-itemset of Ci contains (n-1)-itemsets that
are not contained in the Ln-1 set. If it finds such itemset it prunes it 1 . Formally we can describe this as:
If ∃ Xn-1 ⊆ Xn ∧ Xn-1 ∉ Ln-1 ⇒ Xn ∉ Ci
(6)
The rule generation process finds all rules of the form of Eq. 2 which satisfy the criterion
defined by the analyst as “measure”. As noted in section (2) we will use the mean criterion
assuming that all variables in D follow the normal distribution. Thus RuleGen statistically
compares the mean of each dependent variable in Y for the profile Xn against the mean of
the same dependent variable for the profile of each superset of X. If the two means are statistically different, the rule X ⇒ y, mean(y) is added to the rules set S.
Finally, in contrast with other algorithms, the current algorithm does not combine all
large n-1 itemsets to find the candidate n-itemsets. It performs first a homogeneity check
and combines only those itemsets that pass this check, and also adds them to the Nn set. To
do so the algorithm performs a check on each itemset in Ln to determine the maximum statistical difference that any subset of this profile can have without taking into account any
subordinate Xn+1 profile. In our case we use the max-Z criterion which is the maximum
value of the Z-test perceived to conclude the equality between the mean of the Xn profile
and a subordinate profile. If the maximum Z value that can be produced by the subordinate
profiles is such that it is less than the maxZ defined by the analyst then the itemset Xn is not
inserted into Nn although it might be a member of Ln. Formally we can describe this as follows:
If not ∃ Xn+1 : Z-Test(yTxn+1, yTxn) > maxZ ∀ y ∈ Y ⇒ Xn ∉ Nn
Fig. 2. The Angel Algorithm main loop
1
The pruning phase was first introduced in [SA94].
(7)
955
Angel (D, X, Y, mindif, minsup, maxspan, maxZ,toleratePercent)
1. D = ExtendDimensions(D)
2. AdjustCategories(D)
3. For each yi in Y do
4.
μ(yi) ← Average (yi) in D ; add μ(yi)
M1(Y)
5.
σ(yi)  Variance (yi) in D; add σ(yi)
S1(Y)
6. For each quantitative variable e ∈ EX do
7.
Categorize (e, Y, M1(Y), S1(Y), mindif)
8. C1  All items in X = XQ U XC
9. L1  LargeGen(C1)
//find the first large itemsets array
10. S  S U RuleGen(D,L1,X, Y, M1(Y), M0(Y),S0(Y), maxZ, mindif, minsup)
11. N1  CheckHomogen(L1,Y, M0(Y),S0(Y),maxZ,mindif,minsup)
12. n=2
13. While Nn-1 ≠ ∅
14. {
15.
Cn = Nn-1 x L1
//N1: Large itemsets checked for homogeneity (maxZ)
16.
Cn = Prune (Cn, Ln-1)
17.
Ln  LargeGen(Cn)
//find the first large itemsets array
18.
S  S U RuleGen(D,Ln,X, Y, Mn(Y), M0(Y),S0(Y), maxZ, mindif, minsup)
19.
Nn  CheckHomogen(Ln,Y, M0(Y),S0(Y),maxZ,mindif,minsup)
20.
n = n +1
21. }
22. OutputRules (S, toleratePercent)
4. Experiments
The database that was used for the test is a financial database with bank transactions and
was previously used in the PKDD 1999 Challenge. Our dataset D1 is a profile of the 4.500
accounts in the database and contains information on an account level about properties of
the account, age of the account owner and population of the city where the owner lives. It
also contains summarized information on the transactions made through this account, such
as the summary of insurance payments, interest payments and the average balance, without
containing information on a transaction level, such as transaction date or amount.
The rules produced by the algorithm strengthen the theoretical framework presented in
this study. The following aspects are of a higher scientific interest:
1. The homogeneity check as a combination measure of the itemsets is efficient.
2. The different categorization of the quantitative independent variables is necessary for
multivariate analysis.
Fig. 3. Homogeneity check as a combination measure
956
No rule for: sex = 'M' => amount_loan Mean: -11993,3078 equal to 0 = 0 with
mean -12278,2896 Z-Test: 0,2422
Itemset sex = 'M' - Variable amount_loan - Set Size 225 with maxZ: -36,3612
will be researched further
No rule for: d_region = 'west Bohemia' => amount_loan Mean: -10537,1846
equal to 0 = 0 with mean -12278,2896 Z-Test: 0,8691
Itemset d_region = 'west Bohemia' - Variable amount_loan - Set Size 225 with
maxZ: -2,9408 will be researched further
Rule: sex = 'M' AND d_region = 'west Bohemia' => amount_loan Mean: 6041,0966 different from sex = 'M' with mean -11993,3078 Z-Test: 2,1902
Itemset sex = 'M' AND d_region = 'west Bohemia' - Variable amount_loan Set Size 225 with maxZ: -0,0729 will not be researched further
5. Conclusions
The current algorithm provides a framework for a unified association rules multivariate
analysis of any type of variable, either categorical or quantitative. However there has not
been done any work in optimizing the algorithm’s execution time. In previous studies many
methods have been proposed for optimization of the Apriori algorithm2 . However due to
the change in the nature of the algorithm it is not possible to use the optimizations proposed
for the Apriori algorithm without changes. Also, much of the execution time is spent on the
Window procedure of the categorization of the independent quantitative variables, especially if the number of quantitative variables on the right-hand side is large or if the categorical variables of the right-hand side have many distinct values and therefore produce a
large number of Boolean dependent variables. We combated this inefficiency by splitting
one execution with more variables on each side into smaller executions with less variables
on each side. The algorithm has been tested with datasets of up to 1 million transactions
which is generally considered a small dataset. Performance in larger datasets may be poor
and unacceptable especially if we include the categorization step. Future work has to be
done on improving Angel’s performance.
References
[AIS93] Rakesh Agrawal, Tomasz Imielinski, Arun Swami. Mining association rules between sets of
items in large databases. Proceedings of the 1993 ACM SIGMOD international conference on
Management of data, Pages: 207 - 216, Washington, D.C., United States 1993
2
See [PYC97] for adjusting accuracy to increase performance, [FMMT96] for using parallel bucketing and sampling.
957
[AS94] Rakesh Agrawal, Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules in
Large Databases. International Conference Very Large Data Bases (VLDB), pages 487-499,
Santiago, Chile Sept. 1994
[AUL03] Yonatan Aumann, Yehuda Lindell. A Statistical Theory for Quantitative Association Rules.
Journal of Intelligent Information Systems, Volume 20 , Issue 3, Pages: 255 - 283, May 2003
[LFH00] Hongjun Lu, Ling Feng, Jiawei Han. Beyond intratransaction association analysis: mining
multidimensional intertransaction association rules. ACM Transactions on Information Systems
(TOIS), Volume 18 , Issue 4 , Pages: 423 - 454, ACM Press New York, NY, USA October 2000
[SA96] Ramakrishnan Srikant, Rakesh Agrawal. Mining quantitative association rules in large
relational tables. Proceedings of the 1996 ACM SIGMOD international conference on
Management of data, Pages: 1 - 12, Montreal, Quebec, Canada 1996
Download