Association Rule Mining

advertisement
Association Rule Mining
COMP 790-90 Seminar
GNET 713 BCB Module
Spring 2007
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Approximate match
Compatibility Matrix
When you observe d1
Spread count as
d1: 90%, d2: 5%, d3: 5%
2
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Match
The degree to which pattern P is retained/reflected in S
M(P,S) = P(P|S)= C(p,s) when when lS=lP
M(P,S) = max over all possible when lS>lP
Example
P
S
M
d1d1 d1d3
0.9*0
d1d2
d1d2
d1d2
d1d2
3
d1d2
d1d3
d2d3
d1d2d3
0.9*0.8
0.9*0.05
0.1*0.05
0.9*0.8
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Calculate Max over all
Dynamic Programming
M(p1p2..pi, s1s2…sj)= Max of
M(p1p2..pi-1, s1s2…sj-1) * C(pi,sj)
M(p1p2..pi, s1s2…sj-1)
O(lP*lS)
When compatibility Matrix is sparse O(lS)
4
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Match in D
Average over all sequences in D
5
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Spread of match
If compatibility matrix is identity matrix
Match = support
6
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Anti-Monotone
The match of a pattern P in a symbol sequence S is
less than or equal to the match of any subpattern
of P in S
The match of a pattern P in a sequence database D
is less than or equal to the match of any subpattern
of P in D
Can use any support based algorithm
More patterns match so require efficient solution
Sample based algorithms
Border collapsing of ambiguous patterns
7
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Chernoff Bound
Given sample size=n, range R,
with probability 1-
true value:  
 = sqrt([R2ln(1/)]/2n)
Distribution free
More conservative
Sample size : fit in memory
Restricted spread :
Frequent Patterns
min_match + 
min_match - 
Infrequent patterns
For pattern P= p1p2..pL
R=min (match[pi]) for all 1  i L
8
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Algorithm
Scan DB: O(N*min (Ls*m, Ls+m2))
Find the match of each individual symbol
Take a random sample of sequences
Identify borders that embrace the set of ambiguous patterns
O(mLp * |S| * Lp * n)
Min_match  
existing methods for association rule mining
Locate the border of frequent patterns
in the entire DB
via border collapsing
9
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Border Collapsing
If memory can not hold the counters for all
ambiguous counters
Probe-and-collapse : binary search
Probe patterns with highest collapsing power until
memory is filled
If memory can hold all patterns up to the 1/x layer
the space of of ambiguous patterns can be narrowed to
at least 1/x of the original one
where x is a power of 2
If it takes a level-wise search y scans of the DB, only
O(logxy) scans are necessary when the border
collapsing technique is employed
10
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Periodic Pattern
Full periodic pattern
ABC ABC ABC
Partial periodic pattern
ABC ADC ACC ABC
Pattern hierarchy
ABC ABC ABC DE DE DE DE ABC ABC
ABC DE DE DE DE ABC ABC ABC DE DE
DE DE
12
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Periodic Pattern
Recent Achievements
Partial Periodic Pattern
Asynchronous Periodic Pattern
Meta Pattern
InfoMiner/InfoMiner+/STAMP
13
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
Clustering Sequential Data
CLUSEQ
ApproxMAP
14
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
References: Frequent-pattern
Mining Methods
R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection
algorithm for generation of frequent itemsets. In Journal of Parallel and
Distributed Computing (Special Issue on High Performance Data
Mining), 2000.
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules
between sets of items in large databases. SIGMOD'93, 207-216,
Washington, D.C.
R. Agrawal and R. Srikant. Fast algorithms for mining association
rules. VLDB'94 487-499, Santiago, Chile.
J. Han, J. Pei, and Y. Yin: “Mining frequent patterns without candidate
generation”. In Proc. ACM-SIGMOD’2000, pp. 1-12, Dallas, TX, May
2000.
H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for
discovering association rules. KDD'94, 181-192, Seattle, WA, July
1994.
15
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
References: Frequent-pattern
Mining Methods
A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for
mining association rules in large databases. VLDB'95, 432-443,
Zurich, Switzerland.
C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques
for mining causal structures. VLDB'98, 594-605, New York, NY.
R. Srikant and R. Agrawal. Mining generalized association rules.
VLDB'95, 407-419, Zurich, Switzerland, Sept. 1995.
R. Srikant and R. Agrawal. Mining quantitative association rules in
large relational tables. SIGMOD'96, 1-12, Montreal, Canada.
H. Toivonen. Sampling large databases for association rules.
VLDB'96, 134-145, Bombay, India, Sept. 1996.
M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms
for fast discovery of association rules. KDD’97. August 1997.
16
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
References: Performance
Improvements
S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset
counting and implication rules for market basket analysis.
SIGMOD'97, Tucson, Arizona, May 1997.
D.W. Cheung, J. Han, V. Ng, and C.Y. Wong. Maintenance of
discovered association rules in large databases: An incremental
updating technique. ICDE'96, New Orleans, LA.
T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining
using two-dimensional optimized association rules: Scheme,
algorithms, and visualization. SIGMOD'96, Montreal, Canada.
E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for
association rules. SIGMOD'97, Tucson, Arizona.
J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm
for mining association rules. SIGMOD'95, San Jose, CA, May 1995.
17
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
References: Performance
Improvements
G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong
rules. In G. Piatetsky-Shapiro and W. J. Frawley, Knowledge
Discovery in Databases,. AAAI/MIT Press, 1991.
J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm
for mining association rules. SIGMOD'95, San Jose, CA.
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule
mining with relational database systems: Alternatives and implications.
SIGMOD'98, Seattle, WA.
K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama.
Computing optimized rectilinear regions for association rules.
KDD'97, Newport Beach, CA, Aug. 1997.
M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm
for discovery of association rules. Data Mining and Knowledge
Discovery, 1:343-374, 1997.
18
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
References. Extensions of
Scopes
S. Brin, R. Motwani, and C. Silverstein. Beyond market basket:
Generalizing association rules to correlations. SIGMOD'97, 265-276,
Tucson, Arizona.
J. Han and Y. Fu. Discovery of multiple-level association rules from
large databases. VLDB'95, 420-431, Zurich, Switzerland.
M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I.
Verkamo. Finding interesting rules from large sets of discovered
association rules. CIKM'94, 401-408, Gaithersburg, Maryland.
F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new
paradigm for fast, quantifiable data mining. VLDB'98, 582-593, New
York, NY.
Wei Wang, Jiong Yang, Philip S. Yu: Efficient mining of weighted
association rules (WAR). KDD 2000: 270-274
Wei Wang, Jiong Yang, Richard R. Muntz: TAR: Temporal Association
Rules on Evolving Numerical Attributes. ICDE 2001: 283-292
19
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
References: Extensions of
Scopes
B. Lent, A. Swami, and J. Widom. Clustering association rules.
ICDE'97, 220-231, Birmingham, England.
R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining
association rules. VLDB'96, 122-133, Bombay, India.
R.J. Miller and Y. Yang. Association rules over interval data.
SIGMOD'97, 452-461, Tucson, Arizona.
A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong
negative associations in a large database of customer transactions.
ICDE'98, 494-502, Orlando, FL, Feb. 1998.
D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S.
Nestorov. Query flocks: A generalization of association-rule mining.
SIGMOD'98, 1-12, Seattle, Washington.
J. Pei, A.K.H. Tung, J. Han. Fault-Tolerant Frequent Pattern Mining:
Problems and Challenges. SIGMOD DMKD’01, Santa Barbara, CA.
20
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
References: Mining Maxpatterns and Closed Itemsets
R. J. Bayardo. Efficiently mining long patterns from databases.
SIGMOD'98, 85-93, Seattle, Washington.
J. Pei, J. Han, and R. Mao, "CLOSET: An Efficient Algorithm for
Mining Frequent Closed Itemsets", Proc. 2000 ACM-SIGMOD Int.
Workshop on Data Mining and Knowledge Discovery (DMKD'00),
Dallas, TX, May 2000.
N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent
closed itemsets for association rules. ICDT'99, 398-416, Jerusalem,
Israel, Jan. 1999.
M. Zaki. Generating Non-Redundant Association Rules. KDD'00.
Boston, MA. Aug. 2000.
M. Zaki. CHARM: An Efficient Algorithm for Closed Association
Rule Mining, TR99-10, Department of Computer Science, Rensselaer
Polytechnic Institute.
M. Zaki, Fast Vertical Mining Using Diffsets, TR01-1, Department of
Computer Science, Rensselaer Polytechnic Institute.
21
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
References: Constraint-base
Frequent-pattern Mining
G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of
constrained correlated sets. ICDE'00, 512-521, San Diego, CA, Feb.
2000.
Y. Fu and J. Han. Meta-rule-guided mining of association rules in
relational databases. KDOOD'95, 39-46, Singapore, Dec. 1995.
J. Han, L. V. S. Lakshmanan, and R. T. Ng, "Constraint-Based,
Multidimensional Data Mining", COMPUTER (special issues on Data
Mining), 32(8): 46-50, 1999.
L. V. S. Lakshmanan, R. Ng, J. Han and A. Pang, "Optimization of
Constrained Frequent Set Queries with 2-Variable Constraints",
SIGMOD’99
22
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
References: Constraint-base
Frequent-pattern Mining
R. Ng, L.V.S. Lakshmanan, J. Han & A. Pang. “Exploratory mining
and pruning optimizations of constrained association rules.”
SIGMOD’98
J. Pei, J. Han, and L. V. S. Lakshmanan, "Mining Frequent Itemsets
with Convertible Constraints", Proc. 2001 Int. Conf. on Data
Engineering (ICDE'01), April 2001.
J. Pei and J. Han "Can We Push More Constraints into Frequent
Pattern Mining?", Proc. 2000 Int. Conf. on Knowledge Discovery and
Data Mining (KDD'00), Boston, MA, August 2000.
R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item
constraints. KDD'97, 67-73, Newport Beach, California.
23
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
References: Sequential
Pattern Mining Methods
R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, 3-14,
Taipei, Taiwan.
R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations
and performance improvements. EDBT’96.
J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.-C. Hsu,
"FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining",
Proc. 2000 Int. Conf. on Knowledge Discovery and Data Mining
(KDD'00), Boston, MA, August 2000.
H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent
episodes in event sequences. Data Mining and Knowledge Discovery,
1:259-289, 1997.
J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu,
"PrefixSpan: Mining Sequential Patterns Efficiently by PrefixProjected Pattern Growth", Proc. 2001 Int. Conf. on Data Engineering
(ICDE'01), Heidelberg, Germany, April 2001.
24
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
References: Sequential
Pattern Mining Methods
B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association
rules. ICDE'98, 412-421, Orlando, FL.
S. Ramaswamy, S. Mahajan, and A. Silberschatz. On the discovery of
interesting patterns in association rules. VLDB'98, 368-379, New
York, NY.
M.J. Zaki. Efficient enumeration of frequent sequences. CIKM’98.
Novermber 1998.
M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern
Mining with Regular Expression Constraints. VLDB 1999: 223-234,
Edinburgh, Scotland.
Wei Wang, Jiong Yang, Philip S. Yu: Mining Patterns in Long
Sequential Data with Noise. SIGKDD Explorations 2(2): 28-33 (2000)
Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han: Mining Long
Sequential Patterns in a Noisy Environment. SIGMOD Conference
2002
25
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
References: Periodic Pattern
Mining Methods
Jiawei Han, Wan Gong, Yiwen Yin: Mining Segment-Wise Periodic
Patterns in Time-Related Databases. KDD 1998: 214-218
Jiawei Han, Guozhu Dong, Yiwen Yin: Efficient Mining of Partial
Periodic Patterns in Time Series Database. ICDE 1999: 106-115
Jiong Yang, Wei Wang, Philip S. Yu: Mining asynchronous periodic
patterns in time series data. KDD 2000: 275-279
Wei Wang, Jiong Yang, Philip S. Yu: Meta-patterns: Revealing Hidden
Periodic Patterns. ICDM 2001: 550-557
Jiong Yang, Wei Wang, Philip S. Yu: Infominer: mining surprising
periodic patterns. KDD 2001: 395-400
26
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
References: Mining Various
Databases
K. Koperski, J. Han, and G. B. Marchisio, "Mining Spatial and Image
Data through Progressive Refinement Methods", Revue internationale
de gomatique (European Journal of GIS and Spatial Analysis),
9(4):425-440, 1999.
A. K. H. Tung, H. Lu, J. Han, and L. Feng, "Breaking the Barrier of
Transactions: Mining Inter-Transaction Association Rules", Proc. 1999
Int. Conf. on Knowledge Discovery and Data Mining (KDD'99), San
Diego, CA, Aug. 1999, pp. 297-301.
27
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
References: Mining Various
Databases
H. Lu, L. Feng, and J. Han, "Beyond Intra-Transaction Association
Analysis:Mining Multi-Dimensional Inter-Transaction Association
Rules", ACM Transactions on Information Systems (TOIS’00), 18(4):
423-454, 2000.
O. R. Zaiane, M. Xin, J. Han, "Discovering Web Access Patterns and
Trends by Applying OLAP and Data Mining Technology on Web
Logs," Proc. Advances in Digital Librar ies Conf. (ADL'98), Santa
Barbara, CA, April 1998, pp. 19-29
O. R. Zaiane, J. Han, and H. Zhu, "Mining Recurrent Items in
Multimedia with Progressive Resolution Refinement", Proc. 2000 Int.
Conf. on Data Engineering (ICDE'00), San Diego, CA, Feb. 2000, pp.
461-470.
28
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
References: Frequent-pattern Mining
for Classification and Data Cube
K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and
iceberg cubes. SIGMOD'99, 359-370, Philadelphia, PA, June 1999.
M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D.
Ullman. Computing iceberg queries efficiently. VLDB'98, 299-310,
New York, NY, Aug. 1998.
J. Han, J. Pei, G. Dong, and K. Wang, “Computing Iceberg Data Cubes
with Complex Measures”, Proc. ACM-SIGMOD’2001, Santa Barbara,
CA, May 2001.
M. Kamber, J. Han, and J. Y. Chiang. Metarule-guided mining of
multi-dimensional association rules using data cubes. KDD'97, 207210, Newport Beach, California.
K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and
iceberg cubes. SIGMOD’99
T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades:
Generalizing association rules. Technical Report, Aug. 2000
29
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications
THANK YOU!
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Download