Optimized Association Rules with Effective algorithm for Large

advertisement
Optimized Association Rules with Effective algorithm for
Large Database
Priyanka Mandrai1, Raju Barskar2
1
2Assistant
Student, CSE Department, U.I.T. RGPV
Bhopal, Madhya Pradesh, India
priyanka.mandrai1988@gmail.com
Professor, University CSE Department, U.I.T. RGPV
Bhopal, Madhya Pradesh, India
raju_barskar53@rediffmail.com
Abstract
Association Rule mining is a very efficient technique for
finding correlation among data sets. The correlation of
data gives meaning full extraction process. For the
mining of rules varieties of algorithms are used such as
Apriori algorithm and Tree based algorithm. Some
algorithm is wonder performance but generates
redundant association rule and also suffered from multiscan problem. In this paper we proposed a K-Apriori-GA
association rule mining algorithm based on genetic
algorithm and K-map formula. In this method we used a
K-map binary table for partition of data table at 0 and 1.
The divided process reduces the scanning time of the
database. The proposed algorithm is a combination of KApriori and Genetic algorithm. The support weight key
is a vector value given by the transaction data set. The
process of rule optimization we used genetic algorithm
and for evaluating algorithm conducted the real world
dataset The National Rural Employment Guarantee Act
(NREGA) Department of Rural Development
Government of India and also compare standard Apriori
algorithm and K-Apriori algorithm with our proposed
algorithm.
Keywords: - Association rule mining, Redundant
rules, Multi-pass, K-map, Genetic algorithm,
NREGA
1. Introduction
Association rule mining is a technique to detect the
unknown facts in large dataset and describe
interferences on how subsets of items manipulate
the presence of other subsets. Association rule
mining aims to discover strong relations between
attributes. All frequent generalized patterns are not
very efficient because a portion of the frequent
Patterns are redundant in the association rule
mining and taking more time to generate rules.
This redundant rules drawback can be overcome
with the help of genetic algorithm. Since most of
the data mining approaches use the greedy
algorithm instead of genetic algorithm. Genetic
algorithm is somewhat better as compared to the
greedy algorithm because it performs a global
search and copes better with the attribute
interaction. In genetic algorithm population
evolution is simulated. A genetic algorithm is a
biological technique which uses chromosome as an
element on which solutions (individuals) are
manipulated. Another drawback is multi-passing
this can be overcome with the help of a K - map
formula. K-map binary table for partition of data
table at 0 and 1. The divided process reduces the
scanning time of the database. The proposed
algorithm is a combination of K-Apriori and
genetic algorithm. In this method Support weight
key is a vector value given by the transaction data
set. In this paper we compare Apriori and KApriori algorithm with our proposed algorithm.
The application of association rule mining is on
market basket data, weather prediction, multimedia
data etc. The rest of the paper organized as follows.
In section 2 discuss association rule mining. The
section 3 discusses association rule mining
algorithms. Section 4 described the literature
survey. Section 5 discusses previous algorithms.
Section 6 discusses proposed algorithm, and finally
we explain the experimental result in section 7 and
concluded in section 8.
2. Association Rule Mining
2.1 Problem Definition
The problem of discovering association rules is
expressed as follows: given an information of sales
transaction [1], it's useful to get the vital
associations between items specified the event of
any items during a transaction can involve the
existence of alternative items within the same
transaction. AN example of AN association rule is:
“37% of transactions that contain bread also
contains milk; 7% of all transactions contain each
item”. Here, 37% is called the confidence of the
rule, and 7% the support of the rule. The problem
is to determine all association rules that assure
user-specified minimum support and minimum
confidence constraints. The problem of mining
association rules [2] be initial introduce in and
therefore the following prescribed definition be
proposed [3] in to address the problem.
2.2 Definitions: Association Rule Mining
Let 𝐼 = (𝑖1, 𝑖2 … … . , π‘–π‘š ), be a group of items. Let
D be a group of transactions, wherever all
transaction T includes one or more set of items in
𝐼, such 𝑇 ⊆ 𝐼. Every transaction is associated with
a unique identifier, called 𝑇𝐼𝐷. Let 𝑋 be a group of
items. A transaction 𝑇 is claimed to contain 𝑋 if
and only if 𝑋 ⊆ 𝑇. An association rule is defined as
an expression 𝑋 ⇒ π‘Œ, wherever 𝑋 and π‘Œ are non
empty item sets (i.e. 𝑋 ⊆ 𝐼, π‘Œ ⊆ 𝐼). This rule is
termed as antecedent, such that 𝑋 ∩ π‘Œ = ∅. The
rule 𝑋 ⇒ π‘Œ holds within the transaction set 𝐷 with
support 𝑠, wherever 𝑠% of transactions in 𝐷 that
contain 𝑋 ∪ π‘Œ. The rule 𝑋 ⇒ π‘Œ has confidence 𝑐,
within the transaction set 𝐷, wherever 𝑐% of
transactions in 𝐷 contain 𝑋 that also contain π‘Œ.
Support: The rule 𝑋 ⇒ π‘Œ has support s within the
transaction set 𝐷, if this is the case of transactions
in 𝐷 contains 𝑋 ∪ π‘Œ. Rules that have a 𝑠 larger than
or equal to a user-specified support is termed as a
minimum support threshold (π‘šπ‘–π‘›_𝑠𝑒𝑝).
π‘†π‘’π‘π‘π‘œπ‘Ÿπ‘‘(𝑋 ⇒ π‘Œ) = π‘†π‘’π‘π‘π‘œπ‘Ÿπ‘‘ (𝑋 ∪ π‘Œ)
= 𝑃(𝑋 ∪ π‘Œ)
Confidence: The rule 𝑋 ⇒ π‘Œ has confidence 𝑐
within the transaction set 𝐷, if recollect
transactions in 𝐷 contain 𝑋 that also contain π‘Œ.
Rules that have a 𝑐 larger than or equal to a userspecified confidence is termed as a minimum
confidence threshold ( π‘šπ‘–π‘›_π‘π‘œπ‘›π‘“).
π‘ π‘’π‘π‘π‘œπ‘Ÿπ‘‘ (𝑋 ∪ π‘Œ)
π‘ π‘’π‘π‘π‘œπ‘Ÿπ‘‘ (𝑋)
π‘Œ
= 𝑃( )
𝑋
Typically large confidence values and a smaller
support are used. Rules that satisfy each minimum
support and minimum confidence are known as
robust rules. Since the information is large and
users' concern about only those frequently
purchased items, sometimes thresholds of support
and confidence are predefined by users to drop
those rules that don't seem to be thus remarkable or
helpful.
The problem of discovering all association rules
will be divided into two sub issues [3].
(1) Find all sets of items (itemsets) that have
transaction support higher than the minimum
support. These are the frequent itemsets.
Alternative itemset referred to as infrequent
itemsets.
(2) Use the frequent itemsets to get the specified
rules.
There's a large union between the literature that
the primary sub problem is that the mainly
necessary of the two. This will be because it's more
time consuming because of the enormous search
space and therefore the rule generation section can
be done in main memory in a very simple means
once the frequent itemsets are found. That's the
reason for the huge awareness researchers paid to
the current problem within the recent year.
πΆπ‘œπ‘›π‘“π‘–π‘‘π‘’π‘›π‘π‘’ (𝑋 ⇒ π‘Œ) =
3. Association Rule Mining Algorithms
3.1AIS Algorithm
The AIS (Agrawal, Imielinski, Swami) algorithm
was the primary the first discovered by Agrawal et
al. 1993 [2] for mining association rule. It focuses
on reusing the class of databases usually with the
required functionality to method decision support
queries. During this algorithm only one item
important association rules are generated, which
implies that the results of these rules only contain
one item, as an example, generate rules like 𝑋 ∩
π‘Œ ⇒ 𝑍 however not those rules as 𝑋 ⇒ π‘Œ ∩ 𝑍. The
databases were scanned again and again to get the
frequent itemsets in AIS.
To create this algorithm more efficient, an
estimation technique was proposed to prune those
itemsets candidates that don't have any expect to be
large, so the unnecessary try of counting those
itemsets will be avoided. As all the candidate
itemsets and frequent itemsets are implied to
behold on within the main memory, memory
management is also proposed for AIS once
memory isn't enough. The key disadvantage of the
AIS algorithm is just too many candidate itemsets
that finally turned out to be little are generated,
which needs more space and wastes a lot of effort
that turned out to be ineffective. At the constant
time this algorithm needs too several passes over
the entire database.
3.2 Apriori Algorithm
Apriori is a huge enhancement in the record of
association rule mining, Apriori algorithm was first
introduced by Agrawal and Srikant 1994 [3]. The
AIS has been presenting a straightforward
approach that requires several passes over the
database, generating a variety of candidate itemsets
and storing counters of every candidate while the
majority of them turn out to be not frequent.
Apriori is extra capable in the candidate generation
process for two reasons, Apriori employs a
different candidate generation technique and a new
pruning technique.
Apriori algorithm quiet inherits the problem of
scanning the entire databases several times. Based
on Apriori algorithm, various new algorithms were
proposed with a only some modifications or
improvements. Typically there were two
approaches: one is to reduce the number of passes
more than the entire database or replacing the
complete database with only component of it based
on the current frequent itemsets, one more
approach is to discover different kinds of pruning
process to produce the number of candidate
itemsets much lesser. Apriori-TID and AprioriHybrid [3] , DHP [4], SON [5] are modifications of
the Apriori algorithm.
The best part of the algorithms introduced
above are based on the Apriori algorithm and try to
improve the efficiency by creation a only some
modifications, such as reducing the number of
passes over the database; reducing the range of the
database to be scanned in each pass; pruning the
candidates by different techniques and using
sampling technique. But there are two drawbacks
of the Apriori algorithm. One is the complex
candidate generation method that uses most of the
time, space and memory. Another drawback is the
several scan of the database.
3.3Partition Algorithm
This method was exploited in the partition
algorithm, introduced by savasere et al. In 1995
[5]. The algorithm decreases the database activity
by computing all frequent sets in two passes over
the database. The algorithm workings also on a
level-wise manner, but the thought is to partition
the database into sections small enough to be
handled in main memory. That is, a part is examine
once from the disk, and level-wise generation and
estimation of candidates for that part are performed
in main memory with no further database activity.
The main accomplishment of partition is the
decreases of the database activity. It was revealed
that this reduction was not obtained at the expense
of more CPU utilization, which is another success.
3.4 Pincer-Search Algorithm
Apriori algorithm has to go during several
iterations and, as a result, the performance reduces.
To defeat this complexity is to someway integrate a
bi-directional search, which takes benefit of both
the bottom-up as well as the top-down method. The
pincer-search algorithm was proposed Lin et al.
1997 [6], is based on this principle. It attempts to
find the frequent itemset in a bottom-up approach,
but at the same time. It maintains a list of maximal
frequent itemset. In most cases, our algorithm not
only decreases the number of passes of reading the
database but also can decrease the number of
candidates (for whom support is counted). In such
cases, both I/O time and CPU time are decreased
by eliminating the candidate that are subsets of
maximal frequent itemsets establish it MFCS. The
pincer-search has improvement over Apriori
algorithm while the largest frequent itemset is long.
3.5 Dynamic Itemset Counting Algorithm
One more algorithm called DIC (Dynamic Itemset
Counting) was introduced by Brin et al. 1997 [7]. It
tries to condense the database activity by counting
candidate itemset previous than Apriori does. In
Apriori, candidate π‘˜ + 1-itemsets are not counted
until the 𝐾 + 1π‘‘β„Ž Pass to the database. In DIC, on
the other hand, candidate π‘˜ + 1-itemsets are
counted once the algorithm discovers that all its
subsets of size π‘˜ have exceeded the support
threshold and will be frequent. This is completed
by stopping at several points in the database to
observe the prospect of with other itemsets in the
counting process. It has been establish that such
technique, with practical settings of the number of
transactions passed earlier than stopping for
recalculation, can decrease the number of database
passes considerably whereas maintaining the
number of candidate sets that require to be counted
rather low compared to other proposed techniques.
The algorithm, throughout more capable than
Apriori, was establish to be exposed to data
characteristics. A remedy was recommended for
this difficulty with new success. It was also shown
that by the logical order of items according to
definite criteria, the performance of the algorithm
increases dramatically. However, this item
reordering strategy incurs a lot of overhead.
Since generating the huge 2-itemsets are the most
costly method during the mining process,
experiments in [9] showed that the efficiency of
generating huge 1-itemsets and 2-itemsets in the
SOTrieIT algorithm improves the performance
dramatically, SOTrieIT is much faster than FPTree, but SOTrieIT also faces the similar difficulty
as a FP - Tree.
3.8 Generalized Association Rule Mining
3.6 FP-Tree
Algorithm
(Frequent
Pattern
Tree)
To break the two drawbacks of Apriori algorithms,
association rule mining using tree structure has
been considered. FP-Tree [8], frequent pattern
mining, is another objective in the improvement of
association rule mining, which breaks the two
drawbacks of the Apriori. The frequent itemsets are
generated with only two passes over the database
and without any candidate generation method. FPTree was proposed by Han et al in 2000 [8]. By
avoiding the candidate generation method and less
passes over the database, FP-Tree is an order of
size faster than the Apriori algorithm. The frequent
pattern generation process includes two sub
processes: constructing the FT-Tree, and
generating frequent patterns from the FP-Tree.
Each algorithm has his limitations, for FP-Tree it is
not easy to be used in an interactive mining system.
During the interactive mining process, users may
change the threshold of support according to the
rules. However for FP-Tree the varying of support
may lead to recurrence of the entire mining
method. One more limitation is that FP-Tree is that
it is not appropriate for incremental mining. While
as time goes on databases keep changing, new
datasets may be inserted into the database, those
insertions can also lead to a repetition of the entire
process if we employ FP-Tree algorithm.
3.7 Rapid
Algorithm
Association
Rule
Mining
RARM was introduced by Das et al. 2001 [9] is
another association rule mining process that uses
the tree structure to represent the unique database
and avoids the candidate generation method.
RARM is claimed to be much faster than the FP tree algorithm with the experiments result shown in
the original paper [9]. By using the SOTrieIT
structure RARM can generate large 1-itemsets and
2-itemsets rapidly without scanning the database
for the second time and candidates generation.
Similar to the FP-Tree, each node of the SOTrieIT
contains one item and the equivalent support count.
Generalized association [10] rules use the existence
of a hierarchical classification (concept hierarchy)
of the data to produce different association rules at
different levels in the classification [Srikant1995].
Association rules, and the support-confidence
structure used to mine them, are well-suited to the
market-basket problem. When association rules
are generated, them at any of the hierarchical levels
present. As would be predictable, when rules are
generated for items at an advanced level in the
classification, both the support and confidence
increase. A generalized association rule, X⇒Y, is
defined identically to that of regular association
rule, except that no item in Y can be an ancestor of
any in X [Srikant1995]. A supermarket may want
to find associations concerning to soft drinks in
general or may want to identify those for an exact
brand or type of soft drink (such as a cola). The
generalized association rules allow this to be
accomplished and also ensure that all association
rules (even those across levels in different
taxonomies are established.
4. Literature Survey of Association Rule
Mining Problem:
The computational cost of association rule mining
can be reduced in three ways:
ο‚·
By reducing the number of passes
ο‚·
Remove the redundant association rule
ο‚·
Remove the negative association rule
4.1 Reducing the Number of Passes
The two algorithms that are Apriori and
AprioriTid, which determine all significant
association rules between items in a huge database
of transactions was introduced by Agrawal et al.
[3]. The best features of the two proposed
algorithms can be shared into a hybrid algorithm,
called AprioriHybrid. Scale-up experiments
established that AprioriHybrid scales linearly with
the number of transactions. In addition, the
execution time decreases a small as the number of
items in the database increases. As the average
transaction size increases (while keeping the
database size constant), the execution time
increases only gradually. AIS and SETM have
always been outperformed by the Apriori and
AprioriTid algorithms. There was a considerable
increase in the performance gap with the increase
in problem size, ranging from a factor of three for
tiny problems to more than an order of magnitude
for huge ones. J.S. Park et al. [4] Have proposed a
Direct Hashing and Pruning (DHP) algorithm for
efficient large itemset generation. The proposed
algorithm has two major features: one is efficient
generation for large itemsets and other is an
effective reduction of transaction database size.
Using the hash techniques, DHP is very efficient
for the generation of candidate set in large 2itemsets. Hidber et al. [11] has presented a novel
algorithm named Continuous Association Rule
Mining Algorithm (CARMA), which is used to
compute large itemsets online. In this algorithm it
continuously produced large itemsets along with a
shrinking support interval for each itemset. In
CARMA's itemset lattice quickly approximates a
superset of all large itemsets while the sizes of the
corresponding support intervals shrink rapidly. The
memory efficiency of CARMA was an order of
magnitude greater than Apriori. Apriori and DIC
(Dynamic Itemset Counting) [7] fell behind
CARMA on low support thresholds. Besides, the
CARMA has been found to be sixty times more
memory efficient. Wang et al. [12] proposed a new
class of interesting problem called weighted
association rule (WAR), which mines WARs by
first ignoring the weight and finding the frequent
itemsets and it was followed by introducing the
weight during the rule generation. This approach
shorter average execution time. Bodon et al. [13]
has analyzed theoretically and experimentally
Apriori [3], the most established algorithms for
frequent itemset mining. The implementations of
the Apriori algorithm have displayed large
differences in running time and memory need.
Which modified Apriori and named it as
Apriori_Brave that appears to be faster than the
original algorithm. Li, et al. [14] proposed a new
single-pass algorithm, called Data Stream Mining
for Frequent Itemsets (DSM-FI), which mines all
frequent itemsets over the entire history of data
streams. DSM-FI outperforms the Lossy Counting
[15] in terms of execution time and memory usage
between the large data sets. Ye et al. [16] have
implemented a parallel Apriori algorithm based on
Bodon’s work [13] and analyzed its performance
on a parallel computer. Their implementation was a
partition based Apriori algorithm that partitions a
transaction database. They are also partitioning a
transaction database has improved the performance
of frequent itemsets mining by fitting each partition
in limited main memory for quick access and
allowing incremental generation of frequent
itemsets. Another algorithm for efficient generating
large frequent candidate sets is proposed by Yuan
[17], which is called Matrix Algorithm. The
algorithm generates a matrix which entries 1 or 0
by passing over the cruel database only once, and
then the frequent candidate sets are obtained from
the resulting matrix. Finally association rules are
mined from the frequent candidate sets. This
algorithm is more effective than Apriori Algorithm.
Huan Wu et al. [18] proposed Apriori-based
algorithm IAA. IAA adopts a new count-based
method to prune candidate itemsets and decreasing
the mount of scan data by candidate generation
record, this algorithm can reduce the redundant
operation while generating frequent itemsets and
association rules in the database. Ghosh was using
GA in the discovery of frequent itemsets is that
they perform global search and its time complexity
is less compared to other algorithms as the genetic
algorithm is based on the greedy approach. In this
method to find all the frequent itemsets from given
data sets using genetic algorithm. Paul et al. [19]
proposed by Optimized Distributed Association
Mining Algorithm is used for the mining process
distributed environment. The response time with
the communication and computation factors are
considered to achieve an improved response time.
The performance analysis is done by increasing the
number of processors in a distributed environment.
As the mining process is done in parallel an
optimal solution is obtained. Gautam et al. [20]
proposed by Multilevel association rule mining
algorithm based on the Boolean Matrix (MLBM).
It adopts Boolean relational calculus to discover
maximum frequent itemsets at lower levels. When
using this algorithm first time, it scans the database
once and will generate the association rules.
Apriori property is used to prune the item sets. It is
not necessary to scan the database again. In
addition, it stores all transaction data in bits, so it
needs less memory space and can be applied to
mining large transaction databases. Ghosh et al.
[21] was used GA in the discovery of frequent
itemsets is that they perform global search and its
time complexity is less compared to other
algorithms as the genetic algorithm is based on the
greedy approach. In this method to find all the
frequent itemsets from given data sets using
genetic algorithm. Kamal et al. [22] propose a
novel method, transactional pattern base where
transactions with the same pattern are added as
their frequency is increased. Thus subsequently
scanning requires only scanning this compact
dataset which increases efficiency of the respective
methods. Which is used two-dimensional matrix
instead of using FP-Growth method. The matrix,
used to generate Association rules, is very small in
size as compared to FP Tree. The rules are
generated more quickly. The size of the matrix is
not directly proportional to the no of transactions.
If the frequency of transactions is high, the size of
the matrix will be even smaller. To overcome the
problem of Matrix Apriori Algorithm it should be
repeated with every single update. Therefore, Oguz
et al. [23] proposed a dynamic frequent itemset
mining algorithm, called Dynamic Matrix Apriori
is proposed. It handles additions and deletions as
well. It also manages the challenges of new items
and support changes. The main advantage of the
algorithm is avoiding the entire database scan when
it is updated. It scans only the increments.
4.2 Redundant Association Rules
To deal with the problem of rule redundancy,
various types of research on mining association
rules have been performed. Cristofor et al. [24]
proposed inference rules or inference systems to
prune redundant rules and thus present smaller, and
usually more understandable sets of association
rules to the user. Ashrafi et al [25] presented
several methods to eliminate redundant rules and to
produce a small number of rules from any given
frequent or frequent closed itemsets generated.
Ashrafi et al [26] present additional redundant rule
elimination methods that first identify the rules that
have similar meaning and then eliminate those
rules. Furthermore, their methods eliminate
redundant rules in such a way that they never drop
any higher confidence or interesting rules from the
resultant rule set. Another approach called MTRFMA (modified transaction reduction based
frequent itemset mining algorithm) developed by
Thevar et al. [27] maintains its performance even at
relatively low supports.
AL-Zawaidah et al.
Present a novel approach Feature Based
Association Rule Mining Algorithm (FARMA)
[28] in this algorithm Leverage measure using
minimum leverage thresholds at the same time
incorporates an implicit frequency constraint and
find all itemsets with minimum support and then
filter the found item sets using the leverage
constraint. By using this algorithm to reduce the
generation of candidate itemsets and thus reduce
the memory requirements to store a huge number
of useless candidates. Another approach is also
based on genetic algorithm i.e. Mining Optimized
Association Rules Algorithm (MOAR) [29]
proposed by Wakabi. In this algorithm using the
standard SsTs−dominance relations causes some
interesting or large number of rules are found.
When dominance is solely determined through
support and confidence, there is a high chance of
eliminating interesting rules. It should be a
mechanism for managing their large numbers and
also to significantly improve the response time of
the algorithm. To overcome the problem of
negative rules and superiority uses a genetic
algorithm which gives us the optimized association
rules by using the Apriori algorithm. Jain [30] was
proposed a new algorithm that combination of
support weight value and near distance of superior
candidate key and parity based selection of rule
based on the group value of the rule. This
technique applied to the synthetic database, that
generated the desired rules. Rangaswamy et al. [31]
was to implement association rule mining of data
using genetic algorithms. It improves the
performance of accessing information from
databases (Log file) maintained on the server
machine. This system was to find all the possible
optimized rules from a given data set, using genetic
algorithm for minimizing the time required for
scanning huge databases. Indira et al. [32] analyzes
the performance of the GA in Mining ARs
effectively based on the variations and
modification of GA parameters. The fitness
function, crossover rate, and
mutation rate
parameters are proved to be the primary parameters
include an implementation of genetic algorithms.
5. Previous Algorithms
5.1 Apriori Algorithm
Apriori Algorithm was proposed by Agrawal et.al.
[2], which is considered as one of the most
contributions to association rule mining. Its main
algorithm, Apriori, has affected not only the
association rule mining community, but other data
mining fields as well. The Apriori algorithm for
finding all large item set makes multiple passes
over the database. In the first pass, the algorithm
counts item occurrences to determine large 1-item
sets. The subsequent pass, say pass π‘˜, consist of
two steps. First, the large item sets πΏπ‘˜−1 found in
the (π‘˜ − 1)π‘‘β„Ž pass are used to generate the
candidate item sets πΆπ‘˜ . Then, all those item set
which have some (π‘˜ − 1) subset that is not in πΏπ‘˜−1
are deleted, yielding πΆπ‘˜ .
Algorithm: Apriori algorithm
1) Begin
2) 𝐿1 = {frequent 1-item sets};
3) for ( k = 2; πΏπ‘˜−1 ≠ ∅; k++ ) do begin
4) πΆπ‘˜ = Apriori-Gen(πΏπ‘˜−1 );
5) for all transactions 𝑑 πœ– 𝐷 do begin
6) for all candidates 𝑐 πœ– πΆπ‘˜ contained in t do
7) c.count++;
8) end
9) πΏπ‘˜ = { πœ– πΆπ‘˜ | c.count ≥ minsup}
10) end
11) end
12) Answer = ∪π‘˜ πΏπ‘˜ ;
The main drawback of Apriori algorithm is the
several scans of the database. Further researches
evolved with a new concept of Karnaugh Map. A
Karnaugh map provides a pictorial method of
grouping together expressions with common
factors and therefore eliminating unwanted
variables.
10) for ( π‘˜ = 2 ; 𝐿 π‘–π‘˜ ≠ Ø , 𝑖 = 1, 2, 3, … … . . , 𝑛
; π‘˜ + +) do begin
11) πΆπ‘˜πΊ = ∪ 𝑖=1 π‘‘π‘œ 𝑛 πΏπ‘˜π‘–
12) end
// Phase II
13) for 𝑖 = 1 to 𝑛 do begin
14) π‘˜π‘šπ‘Žπ‘ = π‘˜π‘šπ‘Žπ‘ + π‘˜π‘šπ‘Žπ‘π‘–
15) end
16) for all candidates 𝑐 πœ– 𝐢𝐺 compute 𝑠(𝑐) using
π‘˜π‘šπ‘Žπ‘.
17) 𝐿 𝐺 = {𝑐 πœ– 𝐢𝐺 / 𝑠(𝑐) ≥ π‘ π‘–π‘”π‘šπ‘Ž }
18) Answer = 𝐿𝐺
19) End
Here in this paper, we are making an attempt to
implement the concept of Genetic algorithm along
with K-Apriori Algorithm, so that the number of
database scans can be reduced to one, and
redundant rules are also optimized.
6. Proposed Methodology
5.2 K-Apriori Algorithm
Using the concept of K-Map, a new algorithm was
developed to reduce the number of database scans
to one. In this algorithm K-A priori, Karnaugh map
is used to store the database transactions in reduced
form which needs only one scan of the database
and then the Apriori algorithm is used to identify
frequent sets. But now, support count can be
calculated directly from K-Map, so no further
scanning of the database is required. K-map binary
table for partition the data in table at 0 and 1. The
divided process reduces the scanning time of the
database.
Algorithm: K-Apriori algorithm
1) Begin
2) Initialize : 𝑛 = number of partitions required
𝑁 = number of transactions to be
analyzed
π‘š = 𝑁/𝑛 // number of transactions
in each partition//
//Phase I
3) for 𝑖 = 1 to 𝑛 do begin
4) for 𝑗 = 1 to π‘š do begin
5) 𝑝 = read_partition (𝑇𝑗 in 𝑝𝑖 )
6) end
7) π‘˜π‘šπ‘Žπ‘π‘– = generate_π‘˜π‘šπ‘Žπ‘(𝑝)
8) 𝐿𝑖 = Apriori (π‘˜π‘šπ‘Žπ‘π‘– )
9) end
// Merge phase
We proposed a novel algorithm for optimization of
association rule mining, the proposed algorithm
resolve the problem of redundant association rule
and also optimized the process of multi-pass of
rules. Multi-pass of association rule mining is a
great challenge for large datasets. In this paper we
proposed K-Apriori-GA. In this algorithm K-map
is used to logically partition of the dataset. The
database divided into two sections, one is mapped
data and another is unmapped data. The mapped
data logically assigned 1 and untapped data
logically assigned 0 for the scanning process. The
divided process reduces the scanning time of the
database. This algorithm combined the K-Aprioiri
along with Genetic algorithm. The support weight
key is a vector value given by the transaction data
set. The support value passes as a vector for
finding a near distance between K-map candidates
key. After finding a K-map candidate key the
nearest distance divides into two classes, one class
take a higher order value and another class gain
lower value for the rule generation process. The
process of selection of class also reduces the passes
of the data set. After finding a class of lower and
higher of giving support value, compare the value
of distance weight vector. This distance weight
vector work as a fitness function for selection
process of genetic algorithm. Here we present the
steps of the algorithm.
1. Select data set
2. Put value of support and confidence
3. Logically divided dataset into two parts 0 and 1
4. 1 assigned to mapping part and 0 assigned to
unmapped part
5. Start scanning of transaction table
6. Count frequent items
7. Generate frequent itemsets
8. Check the transaction set of data is null
9. Put the value of support as the weight
10. Compute the distance with Euclidean distance
formula
11.Generate distance vector value for the selection
process
12.Initialized a population set (𝑑 = 1)
13.Compare the value of distance vector with
population set
14. If value of support greater than vector value
15. Processed for encoding of data
16. Encoding format is binary
17. After encoding offspring are performed
18. Set the value of probability for mutation and
the value of probability is 0.006.
19. A set of rules is generated.
20. Check the K - map value of table
21. If the rule is not K-map goes into the selection
process
22. Else optimized rule is generated.
23. Exit
7. Experiments and Results
The experiment conducted on the real world
dataset
The National Rural Employment
Guarantee Act (NREGA) department of rural
development government of India. The objective
of the act is to enhance livelihood security in rural
areas by providing at least 100 days of guaranteed
wage employment in a financial year to every
household whose adult members volunteer to do
unskilled manual work. Proper maintenance of
records is one of the critical success factors in the
implementation of NREGA.
Information on
critical inputs, processes, outputs and outcomes
have to be meticulously recorded in prescribed
registers at the levels of district program
coordinator, program officer, gram Panchayat and
other implementing agencies. The computer based
management information system will also capture
the same information electronically. To evaluate
the algorithm used janpad Panchayat march 2012
records from sehore, district (M.P.). Details of the
monthly squaring of accounts should be made
publicly available on the internet at all levels of
aggregation
through
the
website
(http://nrega.nic.in). In this database contains data
pertaining to name of sub engineers, name of gram
Panchayat, type of works, technical sanction,
administrative sanction, cost of work, completion
date, labor expenditure, material expenditure, total
expenditure of work, physical report of work.
To evaluate the efficiency of the proposed method
we have extensively studied our algorithm's
performance by comparing it with the standard
Apriori algorithm as well as K-Apriori algorithm.
In NREGA has many records but we randomly
selected 731 transactions. This transaction presents
11 attributes and 731 instances. For theses
experiments we have used 4 attributes and 20
instances. The experiment was executed on Intel
(R) Celeron (R) M CPU 1.73GHZ machine and
software was MATLAB 7.8.0, (2009). The below
table represent the rule generation and execution
time for particular support and confidence. In table
1 we have to calculate the Execution time of all
three algorithms.
Table 1
Attributes
A
B
C
D
E
Minimum
Support
0.35
0.40
0.45
0.50
0.55
Minimum
confidence
Apriori
algorithm
0.40
0.47
0.55
0.60
0.65
15.1
11.8
14.5
14.1
13.1
K-Apriori
algorithm
K-Apriori-GA
algorithm
13.3
13.4
13.4
13.7
13.4
12.1
13.5
12.7
12.3
12.3
Comprative analysis of all
three Algorithms
20
Execution Time
Steps of algorithm: (K-Apriori-GA)
Apriori
algorithm
15
K-Apriori
algorithm
10
5
0
A
B
C
D
E
K-AprioriGA
algorithn
Min.Support
Fig. 1 Comparative analysis of execution time
In table 2 we optimized the rules of all three
algorithms
Table 2
Attributes
A
B
C
D
E
Minimum
Support
0.35
0.40
0.45
0.50
0.55
Minimum
confidence
Apriori
algorithm
0.40
0.47
0.55
0.60
0.65
226
177
154
140
123
K-Apriori
algorithm
K-AprioriGA
algorithm
163
154
140
140
126
154
140
126
123
122
Comprative analysis of all
three Algorithms
No. of Rules
250
Apriori
algorithm
200
150
K-Apriori
algorithm
100
50
0
A
B
C
D
Min. Support
E
K-AprioriGA
algorithn
Fig. 2 Comparative analysis of the rules
optimization process
8. Conclusion
The aim of this paper is to enhance the
performance of the standard Apriori algorithm that
mine association rules by presenting a quick and
scalable algorithm for discovering association
rules in massive databases. The approach to
achieve the required improvement is to make a lot
of efficient new algorithm out of the standard one
by adding new features to the Apriori approach.
The proposed mining algorithm will efficiently
discover the association rules between the data
items in massive databases. Specially, redundant
rules are eliminated and scanning time of the
algorithm is reduced considerably. we tend to
compared our propose algorithm to the previously
proposed algorithm. The findings from totally
different experiments have confirmed that our
proposed approach is that the most effective among
the others. It will speed up the data mining method
considerably as demonstrated within the
performance comparison. we tend to demonstrated
the effectiveness of our algorithm using real world
data sets. we developed a visual image module to
produce users the helpful data concerning the
database to be deep-mined and to assist the user
manage and perceive the association rules.
References
[1] M.S. Chen, J. Han, and P.S. Yu, “Data Mining:
An Overview from a Database Perspective,”
IEEE Transactions on Knowledge and
Engineering, Vol 8,no.6 December 1996 pp.
866-883.
[2] R. Agrawal, T. Imielinski, and A. Swami,
“Mining Association Rules between Sets of
Items in Large Database,” Proceedings of the
ACM SIGMOD Conference Washington DC,
USA, May 1993.
[3] R. Agrawal, and R. Srikant, “Fast Algoritms for
Mining Association Rules,” Proceedings of the
20th VLDB Conference Santiago, Chile, 1994
pp. 487-499.
[4] J.S. Park, M. S. Chen, and P.S. Yu, “An
effective hash based algorithm for mining
association rules,” Proceedings of the ACM
SIGMOD
International
Conference
on
Management of Data, 1995 pp. 175-186.
[5] A. Savesere, E. Omiecinski, and S. Navathe,
“An efficient algorithm for mining association
rules in large databases,” Proceedings of 20th
International Conference on VLDB 1995.
[6] D. Lin, and Z.M. Kedem, “pincer search: A
New Algorithm for Discovring the Maximum
Frequent Set,” September 1997.
[7] S. Brin, R. Motwani, J. Ullman and S. Tsur,
“Dynamic Itemset Counting and Implications
Rules for Market Basket Data,” proceeding of
the ACM SIGMOD conference on the
Management of Data, 1997 pp. 255-264.
[8] J. Han, and J. Pei, “Mining Frequent Patterns
by
Pattern-Growth:
Methodology
and
Implications,” ACM SIGKDD Explorations
Newsletter 2000 pp. 14-20.
[9] A. Das, A., W.K. Ng, and Y.K. Woon, “Rapid
Association Rule Mining,” Proceedings of the
10th international conference on Information
and knowledge management, ACM Press, 2001
pp. 474-481.
[10] R. Srikant and R. Agrawal, Mining
Generalized Association Rules, Proceedings of
the 21st VLDB Conference Zurich, Swizerland,
1995 pp. 407-419.
[11] C. Hidber, 1999. “Online association rule
mining”. In A. Delis, C. Faloutsos, and S.
Ghandeharizadeh, editors, Proceedings of the
1999 ACM SIGMOD International Conference
on Management of Data, volume 28(2) of
SIGMOD Record, pp. 145–156. ACM Press.
[12] W. Wang, J. Yang and P.S. Yu, “Efficient
Mining of Weighted Association Rules
(WAR),” Proceedings of the 6th
ACM
SIGKDD
international
conference
on
Knowledge discovery and data mining, 2000
pp. 270 - 274.
[13] F. Bodon, “A Fast Apriori Implementation,”
Proceedings of the IEEE ICDM Workshop on
Frequent Itemset Mining Implementations,
2003 Vol. 90 of CEUR Workshop Proceedings.
[14] H.F. Li, S.Y. Lee and M.K. Shan, “An
Efficient Algorithm for Mining Frequent
Itemsets over the Entire History of Data
Streams,” Proceedings of the 1st International
Workshop on Knowledge Discovery in Data
Streams, 2004 pp. 20- 24.
[15] G. S. Manku and R. Motwani, “Approximate
Frequency Counts Over Data Streams,”.
Proceedings
of
the
28th
VLDB
Conference,Hong Kong, China, 2002.
[16] Y. Ye and C.C. Chiang, “A Parallel Apriori
Algorithm for Frequent Itemsets Mining,” 4th
International
Conference
on
Software
Engineering Research, Management and
Applications, 2006 pp. 87- 94.
[17] Y. Yuan, and T. Huang, “A Matrix Algorithm
for Mining Association Rules,” Lecture Notes
in Computer Science, Springer-Verlag Berlin
Heidelberg , Vol. 3644, Sep 2005, pp. 370 –
379.
[18] H. Wu, Z. Lu, L. Pan, R. Xu, and W. Jiang,
“An Improved Apriori-based Algorithm for
Association Rules Mining,” 6th International
Conference on Fuzzy Systems and Knowledge
Discovery,IEEE 2009 pp. 51-55.
[19]
S. Paul, “An Optimized Distributed
Association Rule Mining Algorithm in Parallel
and Distributed Data Mining with XML Data
for Improved Response Time,” International
Journal of Computer Science and Information
Technology, Vol 2, No. 2, April 2010 pp. 88101.
[20] P. Gautam, and K. R. Pardasani, “A Fast
Algorithm for Mining Multilevel Association
Rule Based on Boolean Matrix,” International
Journal on Computer Science and Engineering
(IJCSE), Vol. 2, No. 3,2010 pp. 746-752.
[21] S. Ghosh, S. Biswas, D. Sarkar, and P.P.
Sarkar, “Mining Frequent Itemsets Using
Genetic Algorithm,” International Journal of
Artificial Intelligence & Applications (IJAIA),
Vol.1, No.4, October 2010 pp. 133-143.
[22] S. Kamal, R. Ibrahim and Zia-ud-Din,
“Association Rule Mining through Matrix
Manipulation using Transaction Patternbase,”
Journal of Basic & Applied Sciences, Vol. 8,
2012 pp. 187-195.
[23] D. Oguz, “Dynamic Frequent Itemset Mining
Based on Matrix Apriori Algorithm,” Master
Thesis, The Graduate School of Engineering
and Sciences of Izmir Institute of Technology,
June 2012.
[24] Cristofor, L., Simovici, D., Generating an
informative cover for association rules. In Proc.
of the IEEE International Conference on Data
Mining, 2002.
[25] Ashrafi, M., Taniar, D. Smith, K. A New
Approach
of
Eliminating
Redundant
Association Rules, Lecture Notes in Computer
Science, Volume 3180, 2004, Pages 465 – 474.
[26] Ashrafi, M., Taniar, D., Smith, K., Redundant
Association Rules Reduction Techniques,
Lecture Notes in Computer Science, Volume
3809, 2005, pp. 254 – 263.
[27] Thevar., R.E; Krishnamoorthy, R,” A new
approach of modified transaction reduction
algorithm for mining frequent itemset”, ICCIT
2008.11th conference on Computer and
Information Technology.
[28] F. H. AL-Zawaidah, and Y.H. Jbara, “An
Improved Algorithm for Mining Association
Rules in Large Databases,” World of Computer
Science and Information Technology Journal
(WCSIT), Vol. 1, No. 7, 2011 pp. 311-316.
[29] P.P. Wakabi–Waiswa, V. Baryamureeba, and
K. Sarukesi, “Optimized Association Rule
Mining with Genetic Algorithms,” 7th
International
Conference
on
Natural
Computation, IEEE, 2011 pp. 1116-1120.
[30] N. Jain, and V. Sharma, “Distance Weight
Parity Optimization of Association Rule
Mining With Genetic Algorithm,” International
Journal of Engineering Research & Technology
(IJERT), Vol. 1, Issue 8, October – 2012 pp. 14.
[31] S. Rangaswamy and Shobha G. “Optimized
Association Rule Mining Using Genetic
Algorithm,” Journal of Computer Science
Engineering and Information Technology
Research (JCSEITR) Vol.2, Issue 1 Sep 2012
pp.1-9.
[32] Indira K. and Kanmani S. “Performance
Analysis of Genetic Algorithm for Mining
Association Rules,” International Journal of
Computer Science (IJCSI) Issues, Vol. 9, Issue
2, No 1, March 2012 pp. 368-376.
Download