Frequent Itemset Mining Algorithms for Big Data using Mahesh A. Shinde

International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016)
Frequent Itemset Mining Algorithms for Big Data using
MapReduce Technique - A Review
Mahesh A. Shinde#1, K. P. Adhiya*2
PG Student, *2Associate Professor
Department of Computer Engineering , SSBT College of Engineering and Technology
Bambhori, NMU Jalgaon, Maharashtra, India
Abstract—Very huge quantity of data is continuously
generated from variety of different sources such as IT
industries, internet applications, hospital history
records, microphones, sensor network, social media
feeds etc called as “Big Data”. By using traditional &
conventional tools big data cannot be handled
because of variety of data. Numerous existing data
mining techniques are developed & presented to
derive association rules and frequently occurring
itemsets, but with the rapid arrival of era of big data
traditional data mining algorithm have been unable to
meet large datasets analysis requirements. Size,
Complexity, and variability of big data are the major
challenges to recognize association rules and frequent
itemsets. A problem of memory and computational
capability is handled by MREclat. ClustBigFIM &
MRPrePost provide scalability and speed to mine data
from large datasets. MapReduce framework is widely
used for parallel processing of Big Data. MapReduce
provide features such as high scalability and
robustness which helps to handle problem of large
datasets. In this paper, we present the deep review on
different frequent itemsets mining (FIM) techniques.
Keywords —Big data, Data mining, Frequent Itemset
Mining, Association Rule mining, MapReduce.
Due to growth of IT industries, services,
technologies and data, the huge amount of complex
data is generated from the various sources that can be
in various form. Such complex and massive data is
difficult to handle and process that contain the billion
records of million user & product information that
includes the online selling data, audios, images, videos
of social media, news feeds , product price and
specification etc. The necessity of big data arrives
from the worldwide famous companies like Google,
Yahoo, Weibo, Facebook, Microsoft, and Twitter for
the reasons of analysis of huge data which can be in
unstructured form. For example, Google contains the
huge amount massive data. To handle and process this
massive data big data analytics is needed. Big data
analytics analyze the huge amount of information and
reveal the association rules, hidden patterns, trends
and the other meaningful information.
In 1998, John Mashley introduced new term called
as ―Big Data[1]‖. Big data is nothing but the
ISSN: 2231-5381
collection of large data which consist of different type
of data. In same year, Indrukya and Weiss[2]
published book on Big data. Title of that book was
―Big Data‖. Normally, data is called as big data
because everyone is generating large quantity of data
each day. In BigMine'12 workshop which was held at
KDD Usamafayyad [3] presented some magical
information about internet usage. Such as, Google
handle more than one billions queries every day,
online social networking service provider twitter and
Facebook has greater than 250 millions twits and 800
millions updates/comments every day respectively and
4 billion user visits YouTube every day.
Doug Laney[4], VP and Distinguished Analyst for
Gartner Research was first person who presented three
V's in Management of Big Data. These 3 V's were as
Volume: The size of data is more than ever
before and it is increasing continuously.
Traditional tools are not sufficient to use such
heavy data.
Variety: There are numerous varieties of
data, such as image/picture, video and audio
with different format, simple text, graphs,
tables, location or log file, sensor data, other
multimedia, and more.
Velocity: Data is growing continuously as a
stream of data, and primary view of user is to
get only meaningful data from it in fewer real
Additional another 2 V’s are:
Variability: This refers that, there are various
changes in the structure of the available
useful information and how users/person
want to interpret that meaningful data.
Value: This refers that, business value that
gives organization a compelling advantage,
due to the ability of making decisions based
in answering questions that were previously
considered beyond reach.
Mainly Big data consist of two types of data: 1.
Structured and 2. Unstructured. Structured data
includes digits and words that are not difficult to
analyze categorize. Structured datais produced from
number of sources like mobile devices, aerial (remote
sensing), software logs, cameras, microphones,
electronic devices, radio-frequency identification
readers, wireless sensor networks. And global
Page 473
International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016)
positioning system devices. Structured data also
consist of things like balance of bank account,
transaction information of bank account. Another type
unstructured data contains more composite data, like
user reviews from flipkart website, tweets from twitter,
images or pictures, videos, comments from Facebook
site and other multimedia. It is really difficult task to
categorize and analyzesuch composite data.
Frequent itemset mining is an imperative part of
data analysis and data mining. The main goal of FIM
is to mine information and reveal patterns from
massive datasets on the basis of frequent occurrence,
i.e., an event is interesting or number of events are
interesting if it occurs/seems frequently in the data,
according to a user given minimum frequency
threshold. Many techniques have been invented to
mine frequent itemsets from databases. These
techniques work well in practice on typical datasets,
but they are not applicable for real Big Data. Using
frequent itemset mining technique to massive
databases is not easy task. There are number of
difficulties. First of all,
databases having large
amount of records do not fit into main memory. In
such cases, solution is to use level wise breadth first
search based algorithms, such as Apriori algorithm, in
this approach frequency counting is getted by reading
the dataset over and over again for each size of
candidate itemsets. Unfortunately, the memory
requirements for handling the complete set of
candidate itemsets blows up fast and renders Apriori
based schemes very inefficient to use on single
machines. Secondly, current approaches tend to keep
the output and runtime under control by increasing the
minimum frequency threshold, automatically reducing
the number of candidate and frequent itemsets.
Google[5] proposed MapReduce framework which is
basically used for parallel processing of large datasets
and it works on key-value pairs. Frequent itemset
mining need to calculate support and confidence
which can be done in parallel using MapReduce
programming model. Faster processing can be
achieved by calculating frequency of items using map
functions which executes in parallel on set of hadoop
clusters and reduce functions used to combine the
local frequent items and give global frequent items.
The organization of this paper is as follows. The
next section II gives background, literature survey and
comparative analysis of FIM techniques. In Section III,
Techniques and tools necessary for big data mining
and MapReduce framework is explained. Conclusion
is presented in sections IV.
Size, complexity and variability of Big Data are big
challenges for recognize association rules and frequent
itemset mining. ―Market –Basket‖ model is best
example of association rule which is based on
relationship among elements[6]. Association rule
mining and frequent itemset mining is well known
techniques of data mining. It discovers frequency of
ISSN: 2231-5381
items purchased together. The whole database scan is
necessary in FIM, it might create challenge when
datasets size is scaling, as large datasets does not fit
into memory. Several approaches exist for association
rule mining [7], [8], [9]. Frequent itemsets play an
essential role in finding correlations, clusters, episodes
and many other data mining tasks. Value discovered
from frequent itemsets can be used to make decisions
in marketing.
Agrawal[6] in 1993 first proposed mining customer
transaction database item sets problem, now FIM
(frequent itemsets mining) has become an essential
part of data mining. Most of the current algorithms are
classified into two groups: Apriori-like algorithm and
FP-growth (Frequent pattern) algorithm. Apriori
rejects candidate sets by repeatedly scanning the
database. The main advantage of FP Growth algorithm
is FP-Tree. When faced with large data, these two
algorithms are not well adapted. For the above
algorithm, a solution is to consider only the large
threshold value, the number of candidates can be
reduced and minimized, but this will lead mining
association rules out inaccurate due to low utilization
The mining of frequent itemsets is a basic and
essential problem in many data mining applications.
Algorithms for mining frequent itemsets can be
basically classified into two types: one is algorithms
based on horizontal layout dataset such as Apriori
algorithm and FP-Growth algorithm;another is
algorithms based on vertical layout database such as
Eclat algorithm. Eclat algorithm takes advantage over
algorithms based on horizontal layout database. It
saves and reduces much time as it does not need to
scan the whole database repeatedly.
Apriori is the most classical algorithm in history of
data mining, the main idea behind the Apriori
algorithm is to generate k+l-frequent itemsets based
on k-candidate itemsets By traversing the database to
statistics candidate collection, then by using support
threshold value candidate itemsets can be neglected.
The pruning strategy of candidate itemsets is that if an
itemset is not occurring frequently, then its superset so
is. The algorithm is very simple, but main drawback is
that Apriorialgorithm requirestoo many times
traversing the database and producing a large number
of candidate sets, time and memory overhead will
become a bottleneck. Comparing with Apriori
algorithm, FP-growth is an improved algorithm. The
main advantage of FP Growth is that only needs to
scan the database twice, and construct a compressed
data structure FP-Tree, which reduces the search space,
while no candidate set, improved memory utilization.
FP Growth adopts to depth-first mode policy.
However, it constructs a large number of conditions
pattern tree when recursive, when faced with huge
amounts of data, the memory is difficult to put all of
the pattern tree, and the tree traversal algorithm whose
time complexity is higher. PFP is based on the
Hadoop (MapReduce Framework) parallel algorithms,
Page 474
International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016)
PFP make groups of the itemsets, as a condition
database partitioned and divided to each node, each
individual node independently generates the FP-Tree
and mines frequent itemsets from individual
partitioned database. PFP minimizes the traffic
between nodes, increases the degree of polymerization
of node. However, algorithm is not efficient if the
database is discrete.
Grouping strategy of PFP has problems with
memory and speed. To balance the groups of PFP
Zhou et al.[10], has proposed algorithm for faster
execution using single items which is also not an
efficient way. Xia et al. [11], has been proposed
Improved PFP algorithm for mining frequent itemsets
from massive small files datasets using small files
processing strategy.
There are number of Hybrid methods are invented
for mining frequent itemsets. MRPrePost is hybrid
method for frequent itemset mining which combines
DistEclat and PrePost algorithm. MREclat is also
hybrid method for frequent itemset mining.
ClustBigFIM is modified BigFIM algorithm for
generating frequent itemsets which uses parallel Kmeans and Eclat for finding potential extensions and
Apriori for producing K-FIs.
A. Literature Survey
Basically, there are three classic frequent itemset
mining algorithms that run in single node. Loop is the
main logic behind success of Apriori [6] algorithms.
In Apriorialgorithm loop k produces frequent itemsets
with length k. By using the property and o/p of k loop,
loop k+1 calculate candidate itemsets. Property is: any
subset in one frequent itemset must also be frequent.
FP-Growth [12] algorithm creates an FP-Tree by two
scan of the whole dataset and then frequent itemsets
are mined from frequent pattern tree. Eclat[4]
algorithm transposes the whole dataset into a new
table. In this new table, every row contains list of
sorted transaction ID of respective item. In last
frequent itemsets are extracted by intersecting two
transaction lists of that item.
Othman et al. [15], presented two different ideas for
conversion Apriori algorithm into MapReduce task. In
first way, all possible itemsets are extracted in
Mapping phase, and then in Reduce phase itemsets
those does not satisfy minimum support threshold are
taken out. In second way, direct conversion from
Apriori algorithm is carried out. Every loop from
Apriori algorithm is converted into MapReduce task.
These presented approaches are used by [13], [14]. In
this approaches large data is shuffled between Map
and Reduce tasks[15]. To solve these problems, they
presented MRApriori algorithm. MRApriori is nothing
but MapReduce based improved Apriori algorithm
which uses two-phase structure.
Zang et al. [16], presented improved Eclat
algorithm to increase the efficiency of FIM from large
datasets. Parallel algorithm MREclat based on
ISSN: 2231-5381
MapReduce framework is called as MREclat
algorithm. MREclat also solves the problems of
storage and capability of computation not enough
when mining frequent itemsets from large complex
datasets. MREclat algorithm has very high scalability
and better speedup in comparison with other algorithm.
Algorithm MREclat consists of three steps: in the
initial step, all frequent 2-itemsets and their tid-lists
from transaction database is getted; the second is the
balanced group step, partition frequent 1-itemsets into
groups; the third is the parallel mining step, the data
got in the first step redistributed to different
computing nodes according to the group their prefix
belong to. Each node runsan improved Eclat to mine
frequent itemsets. Finally, MREclat collects all the
output from each computing node and formats the
final result.
Moens et al.[17], proposed two methods for
frequent itemset mining for Big Data on MapReduce,
First method DistEclat is distributed version of pure
Eclat method which optimizes speed by distributing
the search space evenly among mappers, second
method BigFIM uses both Apriori based method and
Eclat with projected databases that fit in memory for
extracting frequent itemsets. Advantage of Dist-Eclat
and BigFIM is that it provides speed and Scalability
Respectively. Dist-Eclat does not provide scalability
and speed of BigFIM is less.
Riondato et al.[18], has been presented Parallel
Randomized Algorithm (PARMA algorithm) which
finds set of frequent itemsets in less time using
sampling method. PARMA mines frequent patterns
and association rules from precise data. As a result
mined frequent itemsets are approximate those are
close to the original results. It finds the sampling list
using k-means clustering algorithm. The sample list is
nothing but clusters. The main advantage of PARMA
is that it reduces data replication and algorithm
execution is faster.
Liao et al.[19], presented a MRPrePost algorithm
based on MapReduce framework. MRPrePost is an
improved version of PrePost. Performance of PrePost
algorithm is improved by including a prefix pattern.
On this basis, MRPrePost algorithm is well suitable
for mining large data's association rules. In case of
performance MRPrePost algorithm is more superior to
PrePost and PFP. The stability and scalability of
MRPrePost algorithm is better than PrePost and PFP.
The mining result of MRPrePost is approximate which
is closer to original result.
Big FIM [17] overcomes the problems of Dist-Eclat
such as, mining of sub-trees requires entire database
into main memory and entire dataset needs to be
communicated to most of the mappers. BigFIM is a
hybrid approach which uses Apriori algorithm for
generating k-FIs, and then Eclat algorithm is applied
to find frequent item sets. Candidate itemsets do not
fit into memory for greater depth is the limitation of
using Apriori for generating k-FIs in BigFIM
algorithm and speed is slow for BigFIM.
Page 475
International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016)
To address above limitation Gole et al.[20],
Proposed a method ClustBigFIM. ClustBigFIM
provides hybrid approach for frequent itemset mining
for large data sets using combination of parallel kmeans, Apriori algorithm and Eclat algorithm.
ClustBigFIM overcomes limitation of Big FIM by
increasing scalability and performance. Resulting
output of ClustBigFIM gives the approximate results
that are closer to the original results but with faster
speed. ClustBigFIM work with four steps which need
to be applied on large datasets, steps are Find Clusters,
Finding k-FIs, Generate Single Global TID list,
Mining of Subtree.
B. Literature Review
Table 1 gives comparative analysis of
different frequent itemset mining technique which
works on MapReduce framework.
Table I. Comparative Analysis of Different FIM Techniques Based
on MapReduce Framework
Zhou et
FP Growth
Faster execution
Partitioning of
using singleton with search space
balanced distribution using single
item is not best
et al.[18]
Reduces data
replication, Faster
execution, Scaling
Mined frequent
itemsets are
Moens et
Moens et
Liao et
Better stability &
performance better
than PFP & PrePost
Mined frequent
itemsets are
which are
closer to
original result
Gole et
ClustBigFIM Provide scalability&
speed to mine
Frequent patterns,
association rules, and
sequential patterns
correlations from
massive datasets.
Parallel kmeans
results instead
of truly
The Big Data is collection of unstructured and
structured data. This term is basically related to the
open source software revolution. Worldwide famous
companies like Facebook, Yahoo!, Twitter, Microsoft
is taking benefit and contribute working on open
ISSN: 2231-5381
source projects. Big Data infrastructure is basically
deals with Hadoop, and other related software as:
Apache Hadoop[21]: The Apache Hadoop
software library is a framework that allows
for the distributed processing of large data
sets across clusters of computers using
simple programming models. It is designed to
scale up from single servers to thousands of
machines, each offering local computation
and storage/memory. Hadoop allows writing
applications that rapidly process large
amounts of data in parallel on large clusters
of compute nodes. A MapReduce task
partition the input dataset into set of
independent subsets. These partitioned
subsets are processed by map tasks
individually. Then, result of mapping phase
is provided to reduce phase to obtain the final
result of the task.
In Big Data Mining, there are many open source
software. The most popular softwares are the
Apache Mahout [22]: Apache mahout is
scalable machine learning and data mining
open source software library based mainly in
implementations of different data mining
algorithms such as: frequent pattern mining,
clustering, classification etc..
R[23]: R is Open source programming
language and software environment designed
for statistical computing and visualization. In
1993, Ross Ihaka and Robert Gentleman
designed R at University of Auckland, New
Zealand. R is used for statistical analysis of
very large data sets.
MOA[24]: MOA is stream data mining open
source software. The main purpose of MOA
is to perform data mining in real time. MOA
library includes different machine learning
algorithm like classification, regression,
clustering and frequent item set mining and
frequent graph mining, outlier detection,
concept drift detection.
MapReduce Framework: MapReduce by Google
in 2004[25] made a great contribution to the
advent of distributed association rule mining.
There are various algorithms were proposed and
developed or modified to implement on
MapReduce framework. MapReduce framework
improves the capacity of storage and computation
of many distributed commodity machines.
MapReduce can easily perform computation on
huge datasets, and it is also greatly fit in
executing complex parallel algorithms which
make a very limited use of communication.
MapReduce framework has two phases, Map
phase and Reduce phase. Map and reduce
functions are used for large parallel computations
specified by users. Map function takes chunk of
Page 476
International Conference on Global Trends in Engineering, Technology and Management (ICGTETM-2016)
data from HDFS in (key, value) pair format and
generates a set of (key’, value’) intermediate (key,
value) pairs. MapReduce framework collects all
intermediate values which are bind to same
intermediate key and some are passed to reduce
function; it is formalized as, map :: (key, value)
→ (key’ , value’); Value of map function is used
by reduce function. Intermediate key details are
received by reduce function, that are merged
together. The intermediate values are provided to
reduce function through iterator, by using which
too large values fit in memory and formalized as,
reduce :: (key’, list (value’)) → (key’’, value’’)
Output can have one or more output files which
are written on HDFS. Examples such as Inverted
Index, Term Vector per host Distributed Sort,
Distributed Grep, count of URL access frequency
can be completed through MapReduce framework.
Frequent itemset mining is an important research
topic because it is widely applied in real world to find
frequent itemsets and to mine human behavior patterns
and trends. In this paper comparative study of number
of FIM technique is presented. FIM process is both
memory and compute intensive. Various FIM
techniques are proposed and developed from last
couple of year which overcomes the problems of
memory and computational capability insufficient
when mining frequent itemsets from massive datasets.
Also by using hybrid approach, the performance,
Stability and Scalability of algorithm is improved.
Efficiency and scalability are crucial for designing a
FIM algorithm on dealing with large datasets.
However, current distributed FIM algorithms often
suffer from generating huge intermediate data or
scanning the whole transaction database for
identifying the frequent itemsets. In future, search
space should be reduced and instead of approximate
patterns truly frequent patterns should be mined within
less time.
F. Diebold. On the Origin(s) and Development of the Term
―Big Data‖. Pier working paper archive, Penn Institute for
Economic Research, Department of Economics, University
of Pennsylvania, 2012.
M. Weiss and N. Indurkya, ―Predictive data mining: a
practical guide,‖ Morgan Kaufmann Publishers Inc., San
Francisco, CA, USA, 1998.
U. Fayyad. ―Big Data Analytics: Applications and Opportunities in On-line Predictive Modeling,‖ [Online].
Available:, 2012.
D. Laney. ―3-D Data Management: Controlling Data Volume,
Velocity and Variety.‖ META Group Research Note,
February 6, 2001.
J. Dean and S. Ghemawat. ―MapReduce: Simplified data
processing on large clusters‖. In Proc. OSDI. USENIX
Association, 2004.
RakeshAgrawal, Tomasz Imieliński, and Arun Swami,
―Mining association rules between sets of items in large
ISSN: 2231-5381
databases,‖ SIGMOD Rec. 22, 2 (June 1993), 207-216.
JochenHipp, Ulrich Güntzer, and GholamrezaNakhaeizadeh.
―Algorithms for association rule mining — a general survey
and comparison―. SIGKDD Explor. Newsl. 2, 1 (June 2000),
Woo SikSeol, HwiWoonJeong, Byungjun Lee, and Hee
Yong Youn, ―Reduction of Association Rules for Big Data
Sets in Socially-Aware Computing,‖ Computational Science
and Engineering (CSE), 2013 IEEE 16th International
Conference on , vol., no., pp.949,956, 3-5 Dec. 2013.
Jiawei Han. 2005. ―Data Mining: Concepts and Techniques‖.
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
L. Zhou, Z. Zhong, J. Chang, J. Li, J. Huang, and S. Feng.
―Balanced parallel FP Growth with MapReduce‖. In Proc.
YC-ICT, pages 243–246, 2010.
Dawen Xia, Yanhui Zhou, ZhuoboRong, and Zili Zhang,
―IPFP : an improved parallel FP-Growth Algorithm for
Frequent Itemset Mining,‖, 2013.
J. Han, J. Pei, and Y. Yin, ―Mining Frequent Patterns
Without Candidate Generation,‖ in Proceedings of the 2000
ACM SIGMOD International Conference on Management of
Data, ser. SIGMOD ’00. New York, NY, USA: ACM, 2000,
pp. 1–12.
M.-Y. Lin, P.-Y. Lee, and S.-C. Hsueh, ―Apriori-based
Frequent Itemset Mining Algorithms on MapReduce,‖ in
Proceedings of the 6th International Conference on
Ubiquitous Information Management and Communication,
ser. ICUIMC ’12. New York, NY, USA: ACM, 2012, pp.
N. Li, L. Zeng, Q. He, and Z. Shi, ―Parallel Implementation
of Apriori Algorithm Based on MapReduce ,‖ in Proceedings
of the 2012 13th ACIS International Conference on Software
Engineering, Artificial Intelligence, Networking and
Parallel/Distributed Computing, ser. SNPD ’12. Washington,
DC, USA: IEEE Computer Society, 2012, pp. 236– 241.
O. Yahya, O. Hegazy, and E. Ezat, ―An efficient
implementation of Apriori algorithm based on Hadoop
MapReduce model,‖ International Journal of Reviews in
Computing, vol. 12, pp. 59–67, 12 2012.
Zhigang Zhang, GenlinJi, and Mengmeng Tang, ― MREclat:
an Algorithm for Parallel Mining Frequent Itemsets,‖ 2013
International Conference on Advanced Cloud and Big Data,
DOI 10.1109/CBD.2013.22
Moens S , Aksehirli E , Goethals B , ―Frequent Itemset
Mining for Big Data,‖ Big Data, 2013 IEEE International
Conference on , vol., no., pp.111,118, 6-9 Oct. 2013 DOI:
M. Riondato, J. A. DeBrabant, R. Fonseca, and E. Upfal.
―PARMA: a parallel randomized algorithm for approximate
association rules mining in MapReduce‖. In Proc. CIKM,
pages 85–94. ACM, 2012.
Jinggui Liao, Yuelong Zhao, and Saiqin Long,―MRPrePostA Parallel algorithm adapted for mining big data,‖ IEEE
Workshop on Electronics,Computer and Applications, 2014.
SheelaGole, and Bharat Tidke, ―Frequent Itemset Mining for
Big Data in social media using ClustBigFIM algorithm,‖
International Conference on Pervasive Computing
Available :
Available :
R Core Team. ―R: A Language and Environment for
Statistical Computing,‖ R Foundation for Statistical
Computing, Vienna, Austria, 2012. ISBN 3-900051-07-0.
A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. ―MOA:
Available: , Journal of Machine Learning
Research (JMLR), 2010.
J. Dean and G. Sanjay, ―MapReduce: simplified data
processing on large clusters,‖ in Communications of the
ACM , p. 107-113, 2008.
Page 477