Dynamic Data Mining

advertisement
Dynamic Data Mining*
Vijay Raghavan and Alaaeldin Hafez1
(raghavan, ahafez)@cacs.louisiana.edu
The Center for Advanced Computer Studies
University of Louisiana at Lafayette
Lafayette, LA 70504, USA
Abstract. Business information received from advanced data analysis and data
mining is a critical success factor for companies wishing to maximize
competitive advantage. The use of traditional tools and techniques to discover
knowledge is ruthless and does not give the right information at the right time.
Data mining should provide tactical insights to support the strategic directions.
In this paper, we introduce a dynamic approach that uses knowledge discovered
in previous episodes. The proposed approach is shown to be effective for
solving problems related to the efficiency of handling database updates,
accuracy of data mining results, gaining more knowledge and interpretation of
the results, and performance. Our results do not depend on the approach used to
generate itemsets. In our analysis, we have used an Apriori-like approach as a
local procedure to generate large itemsets. We prove that the Dynamic Data
Mining algorithm is correct and complete.
1
Introduction
Data mining is the process of discovering potentially valuable patterns, associations,
trends, sequences and dependencies in data [1-4,10,14,17,20,21]. Key business
examples include web site access analysis for improvements in e-commerce
advertising, fraud detection, screening and investigation, retail site or product
analysis, and customer segmentation. Data mining techniques can discover
information that many traditional business analysis and statistical techniques fail to
deliver. Additionally, the application of data mining techniques further exploits the
value of data warehouse by converting expensive volumes of data into valuable assets
for future tactical and strategic business development. Management information
systems should provide advanced capabilities that give the user the power to ask more
sophisticated and pertinent questions. It empowers the right people by providing the
specific information they need.
Many knowledge discovery applications [6,8,9,12,13,15,16,18,19], such as online services and world wide web applications, require accurate mining information
from data that changes on a regular basis. In such an environment, frequent or
occasional updates may change the status of some rules discovered earlier. More
*
This research was supported in part by the U.S. Department of Energy, Grant No. DE-FG0297ER1220.
1 on leave from The Department of Computer Science and Automatic Control, Faculty of
Engineering, Alexandria University, Alexandria, Egypt
information should be collected during the data mining process to allow users to gain
more complete knowledge of the significance or the importance of the generated data
mining rules.
Discovering knowledge is an expensive operation [2,4,5,6,9,10,11]. It requires
extensive access of secondary storage that can become a bottleneck for efficient
processing. Running data mining algorithms from scratch, each time there is a change
in data, is obviously, not an efficient strategy. Using previously discovered knowledge
along with new data updates to maintain discovered knowledge could solve many
problems, that have faced data mining techniques; that is, database updates, accuracy
of data mining results, gaining more knowledge and interpretation of the results, and
performance.
In this paper, we propose an approach, that dynamically updates knowledge
obtained from the previous data mining process. Transactions over a long duration are
divided into a set of consecutive episodes. In our approach, information gained during
the current episode depends on the current set of transactions and the discovered
information during the last episode. Our approach discovers current data mining rules
by using updates that have occurred during the current episode along with the data
mining rules that have been discovered in the previous episode.
In section 2, a formal definition of the problem is given. The dynamic data
mining approach is introduced in section 3. In section 4, the dynamic data mining
approach is evaluated. The paper is summarized and concluded in section 5.
2
Problem Definition
Association mining that discovers dependencies among values of an attribute was
introduced by Agrawal et al.[1] and has emerged as an important research area. The
problem of association mining, also referred to as the market basket problem, is
formally defined as follows. Let I = {i1,i2, . . . , in} be a set of items and S = {s1, s2, . .
., sm} be a set of transactions, where each transaction si S is a set of items that is si
 I. An association rule denoted by X  Y, X,Y  I, and X  Y = , describes the
existence of a relationship between the two itemsets X and Y.
Several measures have been introduced to define the strength of the relationship
between itemsets X and Y such as SUPPORT, CONFIDENCE, and INTEREST
[1,2,5,7]. The definitions of these measures, from a probabilistic view point, are given
below.
I.
SUPPORT ( X  Y )  P( X ,Y ) , or the percentage of transactions in the database that
contain both X and Y.
II. CONFIDENCE( X  Y )  P( X ,Y ) / P( X ) , or the percentage of transactions
containing Y in those transactions containing X.
III. INTEREST(X  Y )  P( X ,Y ) / P( X )P( Y ) represents a test of statistical
independence.
SUPPORT for an itemset S is calculated as SUPPORT ( S )  F ( S )
F
where F(S) is the number of transactions having S, and F is the total number of
transactions.
For a minimum SUPPORT value MINSUP, S is a large (or frequent) itemset if
SUPPORT(S)  MINSUP, or F(S)  F*MINSUP.
Suppose we have divided the transaction set T into two subsets T1 and T2,
corresponding to two consecutive time intervals, where F1 is the number of
transactions in T1 and F2 is the number of transactions in T2, (F=F1+F2), and F1(S) is
the number of transactions having S in T1 and F2(S) is the number of transactions
having S in T2, (F(S)=F1(S)+F2(S)). By calculating the SUPPORT of S, in each of the
two subsets, we get
F (S )
F ( S ) and
SUPPORT2 ( S )  2
SUPPORT1 ( S )  1
F2
F1
S is a large itemset if
F1 ( S )  F2 ( S )
 MINSUP , or
F1  F2
F1 ( S )  F2 ( S )  ( F1  F2 )* MINSUP
In order to find out if S is a large itemset or not, we consider four cases,
 S is a large itemset in T1 and also a large itemset in T2, i.e.,
F1 ( S )  F1 * MINSUP and F2 ( S )  F2 * MINSUP .
 S is a large itemset in T1 but a small itemset in T2, i.e.,
F1 ( S )  F1 * MINSUP and F2 ( S )  F2 * MINSUP .
 S is a small itemset in T1 but a large itemset in T2, i.e., F1 ( S )  F1 * MINSUP
and F2 ( S )  F2 * min sup .
 S is a small itemset in T1 and also a small itemset in T2, i.e.,
F1 ( S )  F1 * MINSUP and F2(S)< F2*MINSUP.
In the first and fourth cases, S is a large itemset and a small itemset in transaction
set T, respectively, while in the second and third cases, it is not clear to determine if S
is a small itemset or a large itemset. Formally speaking, let SUPPORT(S) = MINSUP
+ , where   0 if S is a large itemset, and   0 if S is a small itemset. The above
four cases have the following characteristics,
 1  0 and 2  0
 1  0 and 2  0
 1  0 and 2  0
 1  0 and 2  0
S is a large itemset if
F1 * ( MINSUP   1 )  F2 * ( MINSUP   2 )
 MINSUP , or
F1  F2
F1 * ( MINSUP   1 )  F2 * ( MINSUP   2 )  MINSUP * ( F1  F2 )
which can be written as F1 *  1  F2 *  2  0
Generally, let the transaction set T be divided into n transaction subsets Ti 's, 1  i  n.
n
S is a large itemset if

Fi *  i  0 , where Fi is the number of transactions in Ti and
i 1
i = SUPPORTi(S) - MINSUP, 1  i  n. -MINSUP  i  1-MINSUP, 1  i  n.
For those cases where
n

i 1
Fi *  i  0 , there are two options, either
 discard S as a large itemset (a small itemset with no history record
maintained), or
 keep it for future calculations (a small itemset with history record
maintained). In this case, we are not going to report it as a large itemset, but
its n F *  formula will be maintained and checked through the future

i
i 1
i
intervals.
3
The Dynamic Data Mining Approach
For
n

i 1
Fi *  i  0 , the two options described above could be combined into a
single decision rule that says discard S if
n

i k
Fi * ( MINSUP   i )
n

=1

i k
Fi

MINSUP

, where 1     , and k1.
Discard S from the set of a large itemsets (it becomes a small itemset with no history record)
Keep it for future calculations (it becomes a small itemset with a history record)
The value of  determines how much history information would be carried. This
history information along with the calculated values of locality can be used to
 determine the significance or the importance of the generated emerged-large
itemsets.
 determine the significance or the importance of the generated declined-large
itemsets.
 generate large itemsets with less SUPPORT values without having to rerun
the mining procedure again.
The choice of which value of  to choose is the essence of our approach. If the value
of  is chosen to be near the value of 1, we will have less declined-large itemsets and
more emerged-large itemsets, and those emerged-large itemsets are more to be
occurred near the latest interval episodes. For those cases where the value of  is
chosen to be far from the value of 1, we will have more declined-large itemsets and
less emerged-large itemsets, and those emerged-large itemsets are more to be large
itemsets in the apriori-like approach.
In this section, we introduce the notions of declined-large itemset, emerged-large
itemset, and locality.
Definition 3.1: Let S be a large itemset ( or a emerged-large itemset, please see
definition 3.2) in a transaction subset Tl , l  1 . S is called a declined-large itemset in
transaction subset Tn , n > l, if
m
MINSUP 

i k
Fi * ( MINSUP   i )
m

i k

MINSUP

Fi
for all l  m  n, where 1  k  m , and 1     ,
Definition 3.2: S is called a emerged-large itemset in transaction subset Tn , n > 1, if
S was a small itemset in transaction subset Tn-1 and Fn *  n  0 , or S was a
declined-large itemset in transaction subset Tn-1, n > 1, and
n

i k
Fi *  i  0 , k  1 .
Definition 3.3: For an itemset S and a transaction subset Tn , locality(S) is defined as
the ratio of the total size of those transaction subsets where S is either a large itemset
or a emerged-large itemset to the total size of transaction subsets Ti , 1  i  n .

Fi
i s .t . S is a l arg e itemset or a emerged l arg e itemset
n

i 1
Fi
Clearly, the locality(S)=1 for all large itemsets S.
The dynamic data mining approach generates three sets of itemsets,
 large itemsets, that satisfy the rule
n

i 1
Fi *  i  0 , where n is the number of
intervals carried out by the dynamic data mining approach
 declined-large itemsets, that were large at previous intervals and still
m
maintaining the rule
, for some value .
 Fi * ( MINSUP   i )
MINSUP 

i k
m

i k
Fi
MINSUP

 emerged-large itemsets, that were
- either small itemsets and at a transaction subset Tk they satisfied the
rule Fk *  k  0 , and still satisfy the rule
n

i k
-
Fi *  i  0 ,
or they were declined-large itemsets, and at a transaction subset Tm they
satisfied the rule m
, and still satisfy the rule n F *   0 .
Fi *  i  0

i
i

i k
i k
Example: Let I={a,b,c,d,e,f,g,h} be a set of items, MINSUP=0.35, and T be a set of
transactions.
For =1,
Transaction
Subset T1
Transaction
Subset T2
Transaction
Subset T3
Transactions
count
{a,b,g,h}
{b,c,d}
{a,c}
{c,g}
3
10
2
4
{d,e,f}
{e,g,h}
{a,b,d}
{b,d,f}
{d,f,h}
{c,h}
{c,h}
{b,d,g}
{a,c}
{b,c}
{g,h}
{a}
{a,b,g,h}
{b,c,d}
{a,c}
{c,g}
1
4
2
1
5
5
12
8
9
1
4
10
5
10
2
4
{d,e,f}
{e,g,h}
{a,b,d}
{b,d,f}
{d,f,h}
{c,h}
1
4
2
1
5
5
Transactions
count
{a,b,g,h}
{b,c,d}
{a,c}
{c,g}
3
10
2
4
{d,e,f}
{e,g,h}
{a,b,d}
{b,d,f}
{d,f,h}
{c,h}
{c,h}
{b,d,g}
{a,c}
{b,c}
{g,h}
1
4
2
1
5
5
12
8
9
1
4
{a}
{a,b,g,h}
{b,c,d}
{a,c}
{c,g}
10
5
10
2
4
{d,e,f}
{e,g,h}
{a,b,d}
{b,d,f}
{d,f,h}
{c,h}
1
4
2
1
5
5
large or
emerged-large
itemsets
{b}
{c}
{d}
{h}
count
SUPPORT
status
locality
16
21
14
17
0.43
0.57
0.38
0.46
large itemset
large itemset
large itemset
large itemset
1
1
1
1
{bd}
13
0.35
large itemset
1
{b}
{c}
{h}
{ch}
25
43
33
12
0.35
0.60
0.46
0.35
large itemset
large itemset
large itemset
emerged-large itemset
1
1
1
{a}
{b}
{c}
{h}
19
43
64
52
0.39
0.36
0.53
0.43
emerged-large itemset
large itemset
large itemset
large itemset
0.41
1
1
1
For =2,
Transaction
Subset T1
Transaction
Subset T2
Transaction
Subset T3
large or
emerged-large
itemsets
{b}
{c}
{d}
{h}
count
SUPPORT
Status
Locality
16
21
14
17
0.43
0.57
0.38
0.46
Large itemset
Large itemset
Large itemset
Large itemset
1
1
1
1
{bd}
13
0.35
Large itemset
1
{b}
{c}
{d}
{g}
{h}
{bd}
{ch}
{a}
{b}
{c}
{d}
{g}
25
43
22
12
33
18
12
19
43
64
36
25
0.35
0.60
0.31
0.35
0.46
0.25
0.35
0.39
0.36
0.53
0.3
0.30
large itemset
large itemset
declined-large itemset
emerged-large itemset
large itemset
declined-large itemset
emerged-large itemset
emerged-large itemset
large itemset
large itemset
declined-large itemset
declined-large itemset
1
1
0.52
0.48
1
0.52
0.48
0.41
1
1
0.31
0.28
{h}
{bd}
{ch}
52
31
17
0.43
0.26
0.20
large itemset
declined-large itemset
declined-large itemset
1
0.31
0.28
When applying an Apriori-like Algorithm on the whole file, the resulting large
itemsets are
large itemsets
{b}
{c}
{h}
count
43
64
52
SUPPORT
0.39
0.58
0.47
By comparing the results in the previous example, we can come with some intuitions
about the proposed approach, which can by summarized as,
 The set of large itemsets and emerged-large itemsets generated by our
Dynamic approach is a superset of the set of large itemsets generated by
the Apriori-like approach.
 If there is an itemset generated by our Dynamic approach but not generated
by the Apriori-like approach as a large itemset, then this itemset should be
large at the latest consecutive time intervals, i.e., a emerged-large itemset.
In lemmas 3.1 and 3.2, we proves the above intuitions.
lemma 3.1: For a transaction set T, the set of large itemsets and emerged-large
itemsets generated by our Dynamic approach is a superset of the set of large itemsets
generated by the Apriori-like approach.
proof: Let iTi=T, 1 I  n, Fi=|Ti| and S be a large itemset that is generated by the
n
Apriori-like approach, i.e.,
F *   0 , and not by our Dynamic approach. There

i 1
i
i
two cases to consider,
Case 1 ( =1)
For a transaction subset Tk , 1  k  n, S is discarded from the set of a large
itemsets, if it becomes a small itemset, i.e., k F *   0 , 1  m  k, and no history

i
im
i
is recorded. Since no history is recorded before m, that means
m 1

i 1
k
leads to

i 1
Fi *  i  0 . For k=n, we have
n
 F *
i
i 1
i
Fi *  i  0 . That
 0 , which contradicts our
assumption.
Case 2: >1
For a transaction subset Tk , 1  k  n, S is discarded from the set of a large
itemsets, if it becomes a small itemset, i.e.,
k

im
Fi *  i  0 , 1  m  k, and
depending on the value of , its history is started to be recorded in transaction
m 1
subset Tm. Since no history is recorded before m, that means

i 1
leads to
k

i 1
Fi *  i  0 . For k=n, we have
n

i 1
Fi *  i  0 . That
Fi *  i  0 , which contradicts our
assumption.
lemma 3.2: If there is an itemset generated by our Dynamic approach but not
generated by the Apriori-like approach as a large itemset, then this itemset should be
large at the latest consecutive time intervals, i.e., a emerged-large itemset.
proof: By following the proof of lemma 3.1, the proof is straight forward.
Algorithm DynamicMining (Tn)
f 1 (Tn ) is the set of
*
1
f (Tn ) is the set of
l arg e and emerged  l arg e itemsets.
declined  l arg e itemsets.
 x is the accumulated value of Fi *  ix .  is the accumulated value of
Cl x is the. accumulated value of
Fi
Fi .
where itemset x is l arg e
begin
    Fi
f1 (Tn )  { (x,Clx ) , Clx  Fn | x  f1(Tn -1 )  x  f1* (Tn -1 )  Fn *  nx  0}  {(x,Clx ) , Clx  Clx  Fn |  x  Fn *  nx  0 } //large or emerged-large itemset
f 1* (Tn )  { (x,Cl x ) | MINSUP 
 * MINSUP  

x

MINSUP
for (k=2;fk-1(Tn);k++) do
begin
Ck=AprioriGen(fk-1(Tn)  fk-1*(Tn))
forall transactions t  T do

}
//was large itemset
n
forall candidates cCk do
if c t then c.count++
f k (Tn )  { (x,Cl x ), x  c, Cl x  Fn | x  f k (Tn-1 )  x  f k* (Tn-1 )  Fn *  nx  0 }  {(x,Cl x ), x  c, Cl x  Cl x  Fn |  x  Fn *  n  0 }
f k* (Tn )  { (x,Cl x ) | MINSUP 
 * MINSUP  

x

MINSUP

}
end
return fk(Tn) and fk*(Tn)
end
function AprioriGen(fk-1)
insert into Ck
select l1,l2, . . .,lk-1,ck-1
from fk-1 l, fk-1 c
where l1=c1 l2=c2 . . .  lk-2=ck-2 lk-1<ck-1
delete all items cCk such that (k-1)-subsets of c are not in fk-1(Tn)
return Ck
lemma 3.3: The Dynamic Data Mining approach is correct.
proof: (See lemmas 3.1 and 3.2)
4
Analysis and Performance Study
In the DynamicMining algorithm, we used an Apriori-like approach as a local
procedure to generate large or emerged-large itemsets. We would like to emphasize
the fact that our approach does not depend on the approach used to generate itemsets.
The main contribution of our approach, is to dynamically generate large itemsets
using only the transaction updates and the information collected in the previous data
mining episode.
Assuming that an Apriori-like procedure is used as a local procedure, the total
number of disk accesses needed for performing the DynamicMining algorithm is
n
K N where Ni is the size (no of disk blocks) of the transaction subset Ti , 1  i  n,

i 1
i
i
and Ki is the length of longest large itemset. On the other hand, the total number of
disk accesses needed for performing an Apriori-like algorithm, which is carried each
time on the whole transaction file is
n
i 1
n
i
 K
i
j 1
N j or
K*
i 1
i
(# of disk blocks of T )
In our preliminary experimental results, the Dynamic Mining algorithm has
shown a significant potential usage. Four main factors have been considered in our
study, namely,




5
The performance of the Dynamic Mining algorithm in terms of disk access, and CPU time.
The knowledge gained by using different values of .
The effect of the locality values on the knowledge discovered through the data mining process.
The generation of the emerged-large itemsets and declined-large itemsets and the significance
of having this information.
Conclusions and Future Work
In this paper, we have introduced a Dynamic Data Mining approach. The proposed
approach performs periodically the data mining process on data updates during a
current episode and uses that knowledge captured in the previous episode to produce
data mining rules. We have introduced the concept of locality along with the
definitions of emerged-large itemsets and declined-large itemsets. The new approach
solves some of the problems that current data mining techniques suffer from, such as,
database updates, accuracy of data mining results, gaining more knowledge and
interpretation of the results, and performance.
We have discussed the Dynamic Data Mining approach. In our approach, we
dynamically update knowledge obtained from the previous data mining process.
Transactions domain is treated as a set of consecutive episodes. In our approach,
information gained during a current episode depends on the current set of transactions
and that discovered information during the previous episode. In our preliminary
experimental results, the Dynamic Mining algorithm has shown a significant potential
usage. We have discussed the efficiency of the Dynamic Mining algorithm in terms of
disk accesses. Also, we have shown the significance of the knowledge discovered by
using different values of , and the effect of the locality values along with the
generation of the emerged-large itemsets and declined-large itemsets on that
knowledge. Finally, we have proved that the Dynamic Data Mining algorithm is
correct.
As a future work, the Dynamic approach will be tested with different datasets that
cover a large spectrum of different data mining applications, such as, web site access
analysis for improvements in e-commerce advertising, fraud detection, screening and
investigation, retail site or product analysis, and customer segmentation.
References
[1]
R. Agrawal, T. Imilienski, and A. Swami, "Mining Association Rules between Sets of
Items in Large Databases," Proc. of the ACM SIGMOD Int'l Conf. On Management of
data, May 1993.
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
R. Agrawal, and R. Srikant, "Fast Algorithms for Mining Association Rules," Proc. Of
the 20 th VLDB Conference, Santiago, Chile, 1994.
R. Agrawal, J. Shafer, "Parallel Mining of Association Rules," IEEE Transactions on
Knowledge and Data Engineering, Vol. 8, No. 6, Dec. 1996.
C. Agrawal, and P. Yu, "Mining Large Itemsets for Association Rules," Bulletin of the
IEEE Computer Society Technical Committee on Data Engineering, 1997.
S. Brin, R. Motwani, et al, "Dynamic Itemset Counting and Implication Rules for Market
Basket Data," SIGMOD Record (SCM Special Interset Group on Management of Data),
26,2, 1997.
S. Chaudhuri, "Data Mining and Database Systems: Where is the Intersection," Bulletin
of the IEEE Computer Society Technical Committee on Data Engineering, 1997.
M. Chen, J. Han, and P. Yu, "Data Mining: An Overview from a Database Prospective",
IEEE Trans. Knowledge and Data Engineering, 8, 1996.
M. Chen, J. Park, and P. YU, "Data Mining for Path Traversal Patterns in a Web
Environment", Proc. 16th Untl. Conf. Distributed Computing Systems, May 1996.
D. Cheung, J. Han, et al, " Maintenance of Discovered Association Rules in Large
Databases: An Incremental Updating Technique", In Proc. 12th Intl. Conf. On Data
Engineering, New Orleans, Louisiana, 1996.
U. Fayyed, G. Shapiro, et al, "Advances in Knowledge Discovery and Data Mining",
AAAI/MIT Press, 1996.
A. Hafez, J. Deogun, and V. Raghavan ,"The Item-Set Tree: A Data Structure for Data
Mining", DaWaK' 99 Conference, Firenze, Italy, Aug. 1999.
C. Kurzke, M. Galle, and M. Bathelt, "WebAssist: a user profile specific information
retrieval assistant," Seventh International World Wide Web Conference, Brisbone,
Australia, April 1998.
M. Langheinrichl, A. Nakamura, et al ,"Un-intrusive Customization Techniques for Web
Advertising," The Eighth International World Wide Web Conference, Toronto, Canada,
May 1999
H. Mannila, H. Toivonen, and A. Verkamo, "Efficient Algorithms for Discovering
Association Rules," AAAI Workshop on Knowledge Discovery in databases (KDD-94) ,
July 1994.
M. Perkowitz and O. Etzioni, "Adaptive Sites: Automatically Learning from User Access
Patterns", In Proc. 6th Int. World Wide Web Conf., santa Clara, California, April 1997.
P. Pitkow, "In Search of Reliable Usage Data on the WWW", In Proc. 6 th Int. World
Wide Web Conf., santa Clara, California, April 1997.
G. Rossi, D. Schwabe, and F. Lyardet, "Improving Web Information Systems with
Navigational Patterns," The Eighth International World Wide Web Conference, Toronto,
Canada, May 1999
N. Serbedzija, "The Web Supercomputing Environment," Seventh International World
Wide Web Conference, Brisbone, Australia, April 1998.
T. Sullivan, "Reading Reader Reaction: A Proposal for Inferential Analysis of Web
Server Log Files", In Proc. 3rd Conf. Human Factors & The Web, Denver, Colorado, June
1997.
C. Wills, and M. Mikhailov, "Towards a Better Understanding of Web Resources and
Server Responses for Improved Caching," The Eighth International World Wide Web
Conference, Toronto, Canada, May 1999
M. Zaki, S. Parthasarathy, et al, " New Algorithms for Fast Discovery of Association
Rules," Proc. Of the 3 rd Int'l Conf. On Knowledge Discovery and data Mining (KDD97), AAAI Press, 1997.
Download