Pattern Finding Techniques: A review

advertisement
Pattern Finding Techniques: A review
Aditi Gupta
Seema Maitrey
Department of Computer Science and Engineering
Department of Computer Science and Engineering
Krishna Institute of Engineering And Technology
Ghaziabad
Krishna Institute of Engineering And Technology
Ghaziabad
Guptaaditi91@gmail.com
Seema.maitrey@gmail.com
ABSTRACT
Pattern mining is the most important step of Data Mining
task and Knowledge Discovery process. Pattern mining is
finding the patterns in data, patterns can be in form of
subsequence, substructures or set of items etc. Specially
finding the frequent patterns in data is most important.
And for this, first research was done by Agrawal et al.
(1993) [1] in form of association rules. In this paper, we
discuss about available methods and techniques for finding
maximal frequent patterns. Here, we will start from first
research association rules and then after there are so many
extensions for this rule.
INTRODUCTION
There are so many algorithms and techniques given by the
various researchers for finding the pattern. Pattern finding
is the very important and the crucial task in data mining
and Knowledge discovery process (KDD). Pattern mining
is one of the step of the data mining and KDD process.
There are various algorithms that can find the frequent
patterns in large datasets very easily. There is so much
research work which defines various methods briefly.
Finding frequent patterns is most important task. Frequent
patterns are the patterns that occur frequently in database
and patterns can be subsequences, substructures or set of
items. Initially, Method for finding frequent pattern was
given by Agrawal et al. (1993) [1] for the market basket
analysis. Frequent pattern mining has its very important use
in many tasks like classification, correlation, and cluster
analysis and in various areas of data mining.
Agrawal et al. (1993) [1] gave the association rule for
finding the patterns. They gave the analysis work for
transaction in market basket analysis. In this, analysis of
the patterns is done that means which item is frequently
purchased with which item , so that a shopkeeper can
arrange its shop and which leads to increase in sale.
TECHNIQUES FOR PATTERN
MINING
First, association rules was given by Agrawal et al. (1993)
[1] in which there is a concept and support and confidence
and all working is done by using min. support threshold
that means pattern which are frequent must have support
greater than min. support threshold. They gave the very
basic Apriori algorithm for finding the pattern. And then
after various researchers gave the extension of the
algorithm.
BASIC ALGORITHMS
Apriori algorithm and its extensions are the basic algorithm
was first given by Agrawal et al. (1993) [1]. In Apriori
algorithm, first scan the database and determine 1-itemset
and repeat this step until there is large candidate itemset is
obtained. And then check the support of all candidate
itemsets and remove all those itemsets whose support is
small.
Apriori algo generates candidate generation for finding the
patterns which is very costly and time consuming for large
databases. SO the frequent pattern tree (FP Tree) was
proposed by Han et al. (2000) [2] BY this approach,
efficiency can be achieved in three ways:
i.
It uses the Pattern fragment growth approach so that it
can’t take so much cost for large databases.
ii. Divide and conquer approach is used for finding the
patterns which divides the big task into smaller tasks.
iii. Large database is converted into compress form(FP
tree Datastructure) which avoids the costly and time
consuming database scan .
FP-growth is at least an order of magnitude faster than
Apriori. From the FP-tree construction process, we can see
that one needs exactly two scans of the transaction
database, DB: the _rst collects the set of frequent items, and
the second constructs the FP-tree. The cost of inserting a
transaction Trans into the FP-tree is O(jTransj), where
jTransj is the number of frequent items in Trans. We will
show that the FP-tree contains the complete information for
frequent pattern mining.
And after constructing the FP Tree we will find the pattern
by using Node-link property and Prefix path property these
two properties are done repeatedly by using FP-growth
algo.
FINDING MAXIMAL FREQUENT
ITEMSETS
Association rules are used for finding the relation between
items for large database. For finding the itemsets all algo
uses the levelwise bottom-up approach or bottom-up and
top-down together. All algo uses the idea of subset lattice
or itemset lattice.
So for finding the frequent closed itemsets new approach
was given by Pasquier et al. (1999) [3], which is A-Close.
For this they gave the concept of closed itemset lattice
which is smaller than itemset lattice and also known as
concept lattice because it is somewhat related to Galois
lattice. So once frequent closed itemsets are determined
then we can easily find frequent item sets.
So this reduces computational cost. Using the A-Close, we
can find the reduced set of frequent item set and valid set of
association rule.
Apriori involves finding the frequent itemsets. Apriori
algorithm uses the bottom-up search in which this search
all frequent itemsets one by one. So the complexity of this
can be calculated in exponential manner and this can be
used only for short patterns.
To solve this problem new algorithm max-miner was
proposed by Bayardo et al. 1998 [4] which can find the
maximal frequent itemsets .Max-miner is very successful
algorithm because it uses “look-ahead” instead of bottomup search.
Max-miner uses the pure breadth-first search in setenumeration tree which reduces the no. of passes in data.
Max-miner uses the pruning as in apriori that is pruning
based on subset in frequency and also based on superset
frequency.
In max-miner approach item ordering policies are used so
that effectiveness of superset –frequency pruning can be
increased. This uses the same data structure as in Apriori.
Max-miner uses the hash tree .Hash tree is also used for
finding the subsets of frequent itemsets. And in second
pass, 2-D array for fast computation of support of itemsets.
PATTERNS
FROM
DIMENSIONAL DATASETS
HIGH-
The growth of bioinformatics pose a great challenge for
finding the frequent item-sets in data using pattern
discovery algorithm because they have exponential
dependency on average row length. In comparison with
transactional data, microarray data has less no. of rows
(samples) but have large no. of column that is genes. So
the new algorithm was given by Pan et al. (2003) [5] named
as CARPENTER. It handles the data which has so many
attributes and relatively small no. of rows. So many
algorithms was developed whose running time increases
exponentially with increase in average row length so that
these are very impractical algorithm for high dimensional
data. So, CARPENTER is a good algorithm for discovering
the frequent closed patterns using depth-first row-wise
enumeration. Unlike other CARPENTER performs search
by enumeration of row sets.
By imposing a total order, such as lexicographic order on
the row sets, we are able to perform a systematic search for
closed patterns. In this whole data is maintained into table
and transpose of this table is done, and the entries of the
transpose tables are used as tuples and entries of the
original table are used as rows. When the database is
transposed, infrequent pattern item will be removed. This
step will take less time to fit in main memory.
CARPENTER method recursively generates the
conditional transposed tables. For generating the tables, it
uses the subroutine MINEPATTERN. And then the pruning
is done. In this pruning method stop the useless traversal of
the tree.
Then, next came row enumeration method in which topdown search is done which takes the full advantage of
pruning power of minimum support threshold for reducing
the search space. Using this type of strategy, algorithm
TD_Close is given by Liu et al. (2006) [6] which finds the
frequent closed patterns. And also a closeness checking is
used to avoid multiple scanning of the data. Row
enumeration method is used to handle high dimensional
data. Performance of row enumeration method is fast than
the column enumeration method. Bottom-up search finds
the row combination from smallest to largest so, it can’t
make full use of minimum support threshold. So the
strategy top-down is used with row –enumeration method.
This td-close algorithm takes less memory than other
methods. Memory for bottom search is costly.
IN td-close algorithm, first start with the table and making
its transpose table. And then initialize the set of frequent
closed pattern and then takes ExcludedSize as 0. ,
TopDownMine method is used for finding frequent
itemsets. And then different pruning strategies are used for
different condition and this TopDownMine method is
recursively used.
FINDING
PATTERNS
STRUCTURAL
Many databases are represented by structural information
for better understanding. Finding the interesting
substructure is important task that better represent the
interpretation of the data. Structure can be represented as
labeled graph. There are so many systems given by
different researchers. Researchers also extended the
clustering system to the structured objects and data.
For finding the substructure, there is a SUBDUE system
which was given by Holder et al. (1994) [8]. This system
uses the MINIMUM DESCRIPTION LENGTH (MDL)
principle. MDL is used in various fields of computer
science like in image processing, decision tree induction
etc. SUBDUE system takes the labeled graph as an input.
There is an algorithm which finds the substructure, this also
includes the various cost like distortion cost etc. And this
cost can vary upto a threshold value.
SUBDUE discovery algorithm starts with the finding the
substructure with single vertex and iteratively all possible
substructure are discovered. And cost is also checked and
MDL principle is used to find best suited substructure.
Although MDL principle is used still SUBDUE system
uses the background knowledge and this background
knowledge is used in the form of some rules to find the
substructure. Compactness and coverage is also used as
rules in SUBDUE system. After the substructure is found
then the all instance is taken as single vertex and similarly
it is converted into compress data which gives the
appropriate information and form the hierarchy of the
concepts which improves the performance of the SUBDUE
system.
Finding patterns in chemical compound is very challenging
task for medical science. Finding the Carcinogenicity of
chemical compound is a complex task. This research gives
more contribution to data mining field, because cancer is
most crucial thing to cause death. Toxicity of the chemical
compounds can be found by finding the patterns by using
the association rules of data mining. Dehaspe et al. (1998)
[9] gave the contribution towards this problem. He used the
concept of DATALOG and also uses the algorithm
WARMR which uses the levelwise search.
FINDING PATTERNS BASED ON
SOME CONSTRAINTS
Although frequent patterns can be found by using the
association rule but there is a problem that users want some
specified patterns based on some constraint. Based on
constraint mining is known as constrained based mining. In
association rule mining, there are problems like lack of
relationship, focus and user control etc.
An architecture based on human exploratory research was
given by Ng et al 1998 [10]. This architecture has a set of
constraint construct having domains, classes etc., and also
proposed constrained association queries. They developed
the category of the constraint according to the property
antimonotonicity and succinctness and then develop the
mining algorithm CAP which maximizes the degree of
pruning for all categories. Architecture is a two-phase
exploratory mining .And this architecture is a downward
compatible. This uses the various algorithms for finding the
frequent sets and they are APRIORI+ and HYBRID (m).
But these algorithms fail to satisfy the property of
antimonotonicity and succinctness. So the algorithm CAP
was given.
Frequent itemsets or patterns are used for classification, so
the association rules are useful for finding the patterns for
classification purpose. In this context, Dong G, Li J (1999)
gave some useful pattern finding method. He found some
useful patterns, emerging pattern (EPs). EPs are useful for
finding the classifiers and have itemsets whose support
changes from one dataset to another.
Apriori algorithm is not useful for finding the EPs and it is
too costly to find EPs for large or high dimensional
databases. For solving this problem, Dong G, Li J (1999)
[11] defined the large itemset by using their concise
borders. And EP mining algorithm which uses the border
and for finding the border they uses border-differential
algorithm which uses the MAX-MINER algorithm. EPs are
useful for finding the pattern in business and demographic
data and when they are applied to data with classes gives
the significant result. EPs are defined by the growth
rate which is support of the one dataset D2 to the support of
other dataset D1.
A cart based method [12] for finding the emerging pattern
is used in medical data etc. This Cart based method uses
decision tree for finding the EPs and Fisher’s exact test is
used to check the EPs. Maximum-likelihood linear
discriminant analysis is also used.
SEQUENTIAL PATTERN MINING
Sequential pattern mining was discovered by Agrawal and
Srikant (1995) [13] and this is important challenge of data
mining task. Sequential pattern mining is the mining of
ordered events without concrete notion of time as patterns
for example customer shopping. Sequential pattern mining
is done in sequence database where sequence means a list
of transaction ordered by time and transaction contains set
of items. So the problem is to find sequential patterns with
minimum support threshold. These patterns can be used in
finding disease means in medical science field. In Agrawal
and Srikant (1995), there were some limitations:
i.
ii.
iii.
Users want to specify the maximum and minimum
time gapes between events but this is not possible in
Agrawal and Srikant (1995).
There was a restriction that items must come from
same transaction.
Allows all items from all levels of hierarchy.
So the problem is solved in Srikant and Agrawal (1996)
[14] by using time constraints, sliding time windows and
hierarchies (taxonomies). For this purpose they gave GSP
algorithm (Generalized Sequential Patterns).They gave 3
algorithms, in which two are not so much useful because
they only find maximal sequential pattern. But third
APRIORIAll is useful than other two. This is
computationally expensive and also it is useful in time
constraints and taxonomies but doesn’t in sliding time
windows. So. The GSP is extension of AprioriAll and
handles all problems linearly. In GSP, candidate sequence
generation is done as same as in simple Apriori algorithm
but satisfying the condition of sliding window, time
constraints and taxonomies. It also uses the data structure
Hash tree for reducing the counting.
Pei J, Han J, Mortazavi-Asl B, Pinto H, Chen Q, Dayal U,
Hsu M-C (2001) [15] gave PrefixSpan method for finding
the sequential patterns.
CONCLUSION
In this paper, we discuss about the research work of the
various researchers in field of data mining to find patterns.
But it is possible that there is not discussed all available
methods and techniques because from the first research till
today, so much research work has been done. May be paper
will help you to understand the pattern mining and its
technique. But so much work can be done in field of data
mining.
REFERENCES
Proceeding of the 2000 ACM-SIGMOD international
workshop data mining and knowledge discovery
(DMKD’00), Dallas, TX, pp 11–20
[8] HolderLB, Cook DJ,Djoko S (1994) Substructure
discovery in the subdue system. In: Proceeding of the
AAAI’94 workshop knowledge discovery in databases
(KDD’94), Seattle, WA, pp 169–180
[9] Dehaspe L, Toivonen H, King R (1998) Finding
frequent substructures in chemical compounds. In:
Proceeding of the 1998 international conference on
knowledge discovery and data mining (KDD’98),
New York, NY, pp 30–36
[1] Agrawal R, Imielinski T, and Swami A (1993) Mining
association rules between sets of items in large
databases. In: Proceedings of the 1993ACMSIGMODinternational conference on management of
data (SIGMOD’93), Washington, DC, pp 207–216.
[10] Ng R, Lakshmanan LVS, Han J, Pang A (1998)
Exploratory mining and pruning optimizations of
constrained associations rules. In: Proceeding of the
1998 ACM-SIGMOD international conference on
management of data (SIGMOD’98), Seattle, WA, pp
13–24
[2] Han J, Pei J, Yin Y (2000) Mining frequent patterns
without candidate generation. In: Proceeding of the
2000 ACM-SIGMOD international conference on
management of data (SIGMOD’00), Dallas, TX, pp 1–
12
[11] Dong G, Li J (1999) Efficient mining of emerging
patterns: discovering trends and differences. In:
Proceeding of the 1999 international conference on
knowledge discovery and data mining (KDD’99), San
Diego, CA, pp 43–52.
[3] Pasquier N, Bastide Y, Taouil R, Lakhal L (1999)
Discovering frequent closed itemsets for association
rules. In: Proceeding of the 7th international
conference on database theory (ICDT’99), Jerusalem,
Israel, pp 398–416
[12] Anne-Laure Boulesteix, Gerhard Tutz, and Korbinian
Strimmer (2003) A CART-based approach to discover
emerging patterns in microarray data.
[4] Bayardo RJ (1998) Efficiently mining long patterns
from databases. In: Proceeding of the 1998 ACMSIGMOD international conference on management of
data (SIGMOD’98), Seattle,WA, pp 85–93
[5] Pan F, Cong G, Tung AKH, Yang J, Zaki M (2003)
CARPENTER: finding closed patterns in long
biological datasets. In: Proceeding of the 2003
ACMSIGKDD international conference on knowledge
discovery and data mining (KDD’03),Washington,
DC, pp 637–642
[6] Liu H, Han J, Xin D, Shao Z (2006) Mining frequent
patterns on very high dimensional data: a topdown
row enumeration approach. In: Proceeding of the 2006
SIAM international conference on data mining
(SDM’06), Bethesda, MD, pp 280–291
[7] Pei J, Han J, Mao R (2000) CLOSET: an efficient
algorithm for mining frequent closed itemsets. In:
[13] Agrawal R, Srikant R (1995) Mining sequential
patterns. In: Proceedings of the 1995 international
conference on data engineering (ICDE’95), Taipei,
Taiwan, pp 3–14
[14] SrikantR, AgrawalR (1996) Mining sequential
patterns:
generalizations
and
performance
improvements. In: Proceeding of the 5th international
conference on extending database technology
(EDBT’96), Avignon, France, pp 3–17
[15] Jiawei Han, Hong Cheng, Dong Xin, Xifeng Yan
Frequent pattern mining: current status and future
Directions.
[16] Pei J, Han J, Mortazavi-Asl B, Pinto H, Chen Q,
Dayal U, Hsu M-C (2001) PrefixSpan: mining
sequential patterns efficiently by prefix-projected
pattern growth. In: Proceeding of the 2001
international conference on data engineering
(ICDE’01), Heidelberg, Germany, pp 215–224
Download