ijett_2016

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 36 Number 1- June 2016
The Survey on Approaches to Efficient
Clustering and Classification Analysis of Big
Data
Bhagyashri S. Gandhi 1, Leena A. Deshpande 2
Department of Computer Engineering,
Vishwakarma Institute of Information Technology, Pune, India.
Abstract— Analysts of different fields have shown a
good interest in data mining. Data mining is the
process of inferring useful patterns from the huge
amount of data. Regarding data storage and
management process, classical statistical models are
however protective. Big data is a popular
terminology which is intermittently discussed in the
present day, used to describe the enormous quantity
of data that may exist in any format. These populous
data are so complex and dynamic in nature that
makes it laborious for typical manipulating
strategies to mine the appropriate knowledge from
such huge amount of data. The fundamental purpose
of this paper is to provide a comprehensive analysis
of various techniques involved in mining big data
and acknowledge the challenges associated with big
data processing and management.
Keywords— Big data, Data mining, Complex,
Dynamic, Ensemble, Classification, Clustering,
Distributed environment.
I. INTRODUCTION
The era of Big data is here: Big data is the data
that grows every year. Revolution in scientific and
technological facet has affected the size of data
which increases on a daily basis with an aim to
improve profitable activities. However, information
retrieval and browsing have lead to bring an entirely
progressive transformation by apprehending the
whole that is readily available on cyberspace and
produces it to the required ones in useful ways. This
accumulates more than billions of bytes of data per
day and casts exclusive data at regular intervals.
Since the new arrival of data may be structured,
unstructured or even complex and dynamic in
nature; existing tools and techniques do not board
themselves to acquainted data analysis mechanisms
to commensurate with user requirements.
As business and technology go hand-in-hand,
their increasing dependence ensures that the data
will continue to enlarge at a faster rate. Various
sources provide diversification in data and at
particular instance; one may want to use all of the
available information to obtain optimal classification
solution. Integration of this information sources is
difficult to acquire as they have different structure
ISSN: 2231-5381
formats. So may approach either to classification or
clustering results in the area of privacy.
Traditionally,
single
conventional
classification models were implemented to solve big
data challenges which require deep labelling action
and is frequently the thing that we have relatively
20% of labelled occurrences to train a classifier
model, but 80% of unlabeled occurrences are
attainable to frame clusters from big data. This paper
focuses to survey on classification and clustering
techniques and classifies unstructured data to
enhance prediction accuracy of data classification.
In this paper, we present an ensemble classifier
system for big data analysis. The rest of the paper is
structured as follows. Section II presents related
work. Section III presents the significance of the
topic. Section IV presents the experimental analysis
and gives its performance results. Section V
concludes the work presented in this paper and
Section VI draws directions for future work.
II. RELATED WORK
Eugenio Cesario, et. al. [1] proposed a bagging
technique, also known as bootstraps aggregating,
and is a popular ensemble learning mechanism. In
this process, some bootstrap samples are depleted
from original training data. Base classifiers are
generated to train these fragments locally. Voting
function is implemented among these classifiers to
evaluate new sample data. New sample data is
labeled by the class which receives the highest
voting by participating classifiers. As sampling is
done with replacement, redundant occurrences may
emerge in the same bootstrap sample or some of
them may even fail to appear, causing accuracy
deterioration.
The current technique of ensemble clustering
proposed by Liping Jing, et. al. [2] integrates
multiple clusters obtained from given data set into a
single cluster of the same size to attain results better
than individual clusters. Furthermore, this approach
is enforced in a centralized framework with an
expectation to
use
distributed
computing
environment in future work.
Jie Hu, et. al. [3] proposed the Generation and
Consensus function of cluster ensemble technique.
http://www.ijettjournal.org
Page 33
International Journal of Engineering Trends and Technology (IJETT) – Volume 36 Number 1- June 2016
In generation function, a set of partitions of original
data set were composed using generative
mechanism. In consensus function, a unique data set
is produced by aggregating all the partitions
generated in generation step.
The existing system of ensemble technique lends
basic disadvantages associated with consensus
function. In consensus function, if all the
participating classifiers acknowledge with the class
label assigned, it is used to label the test instance.
Otherwise, if any of the classifiers disagrees, the test
instance is rejected while in majority voting, each
participating classifier gives a vote for the class label
prediction and a class with the maximum number of
votes is treated as a final prediction of the multiple
classifiers. If there is a tie of votes, the test data is
dropped. It is clear that the former approach is more
restrictive, and hence, higher performance is
expected at the expenses of an increase in the
rejection rates. The later approach does not need a
total agreement among the classifiers, rejecting less
amount test data [4].
All the above approaches employ clustering alone
as part of proposed system. But clusters alone are
not adequate for classification process as they
deliver no absolute class label acquaintance. Peng
Zhang, et. al. [5] proposed an ensemble for both
clusters and classifiers to encounter this issue. In this
system, ensemble learning operates on divide-andconquer approach, first to segregate the endless data
streams into small data chunks and then create
incompetent base classifiers from the small chunks.
At the eventual stage, all the participating classifiers
are combined for prediction.
An approach to classifying cattle behaviour
patterns such as grazing, ruminating, resting,
walking, etc. along with unsupervised clustering to
guide next stage of supervised machine learning
proposed by Ritaban Dutta, et. al. [6] using sensor
technology to identify and monitor behavioural
changes in cattle.
By exploiting the combined approach of
supervised and unsupervised learning, an ensemble
model can aid with a number of assets, such as
accommodating rapidly to new test instances,
attaining lower deviations as compared to individual
model and can be easily correlated. Since the
proposed system is implemented in a centralized
structure, it may lack to deal with the computer
aspects of CPU utilization and memory management
and makes it difficult to process voluminous data to
provide expected accuracy results unlike in
distributed computing environment which allows
each classifier to execute concurrently and thus
increase the performance of the system.
An IntelliHealth application is a medical decision
support application where decisions are made based
on weighted multi-layer classifier ensemble
framework proposed by Saba Bashir, et. al. [7]. This
approach depicts the comparison between the state
ISSN: 2231-5381
of art technique and multi-layer classifier ensemble
technique on accuracy, sensitivity and specificity.
The framework is evaluated on five different heart
disease datasets, four breast cancer datasets, two
diabetes datasets, two liver disease datasets and one
hepatitis dataset obtained from public repositories.
Recognizing emotions in the text as a part of
sentiment analysis using an ensemble of classifiers
proposed by Isidoros Perikos, et. al. [8]. In this
approach, the resulting ensemble classification
prediction is based on majority voting conducted on
all the participating classifiers to recognize emotion
presence in text and also to identify the polarity of
the emotions.
Warapat Paireekreng, et. al. [9] proposed an
approach to address the problem of determining the
grade of a real estate project using a real estate
ensemble classification technique. This approach
helps loaners to make decisions for the further stage
of a loan.
Hence, it motivates to use the multi-classifier
model in distributed environment to optimize class
labels in big data and improve the accuracy as well
as the efficiency of memory management.
In Ensemble technique of multi-classifier model,
a huge complex problem can be partitioned into
many small problems that are easier to figure out. It
helps to achieve a consolidated solution and reduce
errors made by individual models. They are more
robust and stable and improve the classification
accuracy over single model method.
Computer scientists, Doug Reed Cutting and
Mike Cafarella created Hadoop back in 2005. Being
an open source platform that supports processing of
large data sets in a distributed computing
environment, it helps to accomplish the execution of
multiple
classifiers
simultaneously
through
MapReduce.
Hadoop
is
Apaches
free
implementation of a MapReduce framework, the
core idea behind MapReduce is mapping our dataset
into a collection of <key, value> pairs and then
reducing all pairs with the same key.
III. SIGNIFICANCE
Big data has been globally used to cart all sorts of
objectives embedded in: a tremendous amount of
data, communal analytics, data storage, processing
and management abilities, real-time and archival
data; complex, dynamic and unstructured data and
much more. Undertaking these features associated
with big data, there are enormous significant
applications to deal with. Big data in combination
with data processing mechanisms can be utilized in
medical domain to examine and classify to label the
disease [7], [10], [11]. Other than the need for big
data analytics in the medical field, it is also suitable
in Government domain in framing smart cities [12].
Several other advantages include: to detect
fingerprints to classify them as legitimate or
http://www.ijettjournal.org
Page 34
International Journal of Engineering Trends and Technology (IJETT) – Volume 36 Number 1- June 2016
illegitimate for security purpose [4], to control traffic
in peak times based on live streaming data about
vehicles, using Bag-of-Features model to identify
food contents for the diabetic patient [13], to
monitor remote sensors for weather predictions. It
can also be applied in the area where software
programs are at high risks of defects. The ensemble
of feature selection mechanism can be used to
diagnose and correct fault predictions [14].
The accuracy of NaiveBayes classifier is 68.48%,
accuracy of J48 classifier is 93.48%, accuracy of
BayesNet classifier is 93.48% and accuracy of
Ensemble classifier is 93.48%.
IV. IMPLEMENTATION
A. Dataset
Electricity dataset, downloaded from UCI
repository is used as data to test over individual and
ensemble classifier. Electricity data is a sample
dataset which consists of class labels as UP and
DOWN.
The format of this dataset contains five columns
denoted by: Date, Day, Instances, Price and
Demand.
B. Dataset is implemented over 4 test beds
Electricity dataset is implemented and tested
over four classifiers. Among these four classifiers,
three individual classifiers namely:
NaiveBayes classifier
J48 classifier
BayesNet classifier
The fourth is an Ensemble classifier with base
classifiers as NaiveBayes, J48 and BayesNet
classifier.
C. Analysis of dataset on weekend
The dataset is analyzed considering only 6th and
7th day of all week days.
Fig. 1 NaiveBayes classifier output
D. Result analysis on weekend
There are many approaches used to calculate
algorithmic performance at the accumulative level.
When Electricity dataset is tested over four
classifiers, we obtain the accuracy summarized in
TABLE I using “recall” approach.
.
TABLE I
Performance characteristics of
Classifiers
Classifiers
Accuracy
Classifier
approach
NaiveBayes
J48
BayesNet
Ensemble
68.48%
93.48%
93.48%
93.48%
Individual
Individual
Individual
Hybrid
ISSN: 2231-5381
http://www.ijettjournal.org
Page 35
International Journal of Engineering Trends and Technology (IJETT) – Volume 36 Number 1- June 2016
Fig. 4 Ensemble classifier output
Fig. 2 J48 classifier output
Fig. 5 Weekend graph
E. Analysis of dataset on price and demand
The dataset is analyzed for price value of 1700
and demand value of 7000.
F. Result analysis on price and demand
The accuracy of NaiveBayes classifier is 57.07%,
accuracy of J48 classifier is 79.61%, accuracy of
BayesNet classifier is 81.81% and accuracy of
Ensemble classifier is 81.83%.
Fig. 3 BayesNet classifier output
ISSN: 2231-5381
http://www.ijettjournal.org
Page 36
International Journal of Engineering Trends and Technology (IJETT) – Volume 36 Number 1- June 2016
Fig. 6 NaiveBayes classifier output
Fig. 8 BayesNet classifier output
Fig. 7 J48 classifier output
Fig. 9 Ensemble classifier output
ISSN: 2231-5381
http://www.ijettjournal.org
Page 37
International Journal of Engineering Trends and Technology (IJETT) – Volume 36 Number 1- June 2016
One can even analyze the data for rest of the week
days and for a particular time instance.
V. CONCLUSIONS AND FUTURE SCOPE
A novel survey on the practice of big data
constitutes of huge classical heterogeneous data that
are ever changing and complicated in nature which
is gathered across several origins. These huge
storehouses are beyond the capabilities of traditional
database tools and techniques to recognize, evaluate
and handle effortlessly. As availability and
attainment of data are the two most critical issues in
current date, which need to be handled since
conventional data processing systems are inadequate
to accomplish necessary support. To overcome these
features of big data, an ensemble classification and
clustering combination in distributed computing
environment is proposed.
Ensemble technique takes the dominance over
single component classification model of being more
stable, robust and accurate to many hardships
incurred in the nature of data. This technique
provides maximal sensitivity and F-measure when
compared with the state of art techniques.
Additional benefits to ensemble technique can be
obtained by enforcing ensemble model in distributed
environment like Hadoop and MapReduce to gain
faster results by the simultaneous implementation of
participating modules and scaling to the huge
quantity of data. This paper gives details of
approaches proposed for dealing with big data.
This paper focus on both clustering and
classification model of ensemble technique to deal
with the gradual flow of data. The significant
characteristic of endless data is its rate of arrival.
Acceleration in arrival rate makes it crucial for
conventional classification techniques to manage
and store them efficiently in memory permanently
because of two major issues:
A. Volume
The volume of data is so huge that only small
measure of data can be handled and refined while
the rest of the information is overlooked.
B. Concept drift
ACKNOWLEDGMENT
We are especially thankful to Vivek Ghule who
helped us during the writing of this paper.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
Concept drift is termed to represent the variations
in data that occur over time. Sudden drift being
another issue of interest shows accidental, instant
and inevitable changes in data that should be
handled efficiently.
Thus, our future work is to deal with sudden drift
in data and so to process this data successfully to
analyze the class labels for test data.
[12]
[13]
[14]
ISSN: 2231-5381
Eugenio Cesario, Carlo Mastroianni, Domenico Talia,
“Distributed volunteer computing for solving ensemble
learning Problems,” Future Generation Computer Systems,
ELSEVIER, 3 August 2015.
Liping Jing, Kuang Tian, Joshua Z. Huang, “Stratified
feature sampling method for ensemble clustering of high
dimensional data,” Pattern Recognition, ELSEVIER, 13
May 2015.
Jie Hu, Tianrui Li, Hongjun Wang, Hamido Fujita,
“Hierarchical cluster ensemble model based on knowledge
granulation,” Knowledge-Based Systems,ELSEVIER,16
October 2015.
Mikel Galar, Joaquín Derrac, Daniel Peralta, Isaac
Triguero, Daniel Paternain, Carlos Lopez-Molina, Salvador
García, José M. Benítez, Miguel Pagola, Edurne
Barrenechea, Humberto Bustince, Francisco Herrera, “A
survey of fingerprint classification part II: experimental
analysis and ensemble proposal,” Knowledge-Based
Systems, ELSEVIER, 2015.
Peng Zhang, Xingquan Zhu, Jianlong Tan, Li Guo,
“Classifier and cluster ensembles for mining concept
drifting data streams,” IEEE International Conference on
Data Mining, 2010.
Ritaban Dutta, Daniel Smith, Richard Rawnsley, Greg
Bishop-Hurley, James Hills, Greg Timms, Dave Henry,
“Dynamic cattle behavioural classification using
supervised ensemble classifiers,” Computers and
Electronics in Agriculture, ELSEVIER, 6 December 2014.
Saba Bashir, Usman Qamar, Farhan Hassan Khan,
“IntelliHealth: a medical decision support application using
a novel weighted multi-layer classifier ensemble
framework,” Journal of Biomedical Informatics,
ELSEVIER, 15 December 2015.
Isidoros Perikos, Ioannis Hatzilygeroudis, “Recognizing
emotions in text using ensemble of classifiers,”
Engineering
Applications of Artificial
Intelligence,
51, 191-201, ELSEVIER, 2016.
Worapat Paireekreng, Worawat Choensawat, “An
ensemble learning based model for real estate project
classification,” 6th International Conference on Applied
Human Factors and Ergonomics (AHFE 2015) and the
Affiliated Conferences, AHFE 2015, ELSEVIER, 2015.
João Cunhaa, Catarina Silva, Mário Antunes, “Health
Twitter big data management with Hadoop framework,”
Conference on ENTERprise Information Systems /
International Conference on Project MANagement /
Conference on Health and Social Care Information
Systems and Technologies, CENTERIS / ProjMAN / HCist
2015 October 7-9, 2015, ELSEVIER, 2015.
Dr. Saravana kumar N, Eswari, Sampath & Lavanya,”
Predictive methodology for Diabetic data analysis in big
data,” 2nd International Symposium on Big Data and
Cloud Computing (ISBCC’15), ELSEVIER, 2015.
J. Archenaa and E. A. Mary Anita,“A survey of big data
analytics in healthcare and government,” 2nd International
Symposium on Big Data and Cloud Computing
(ISBCC’15), ELSEVIER, 2015.
Marios M. Anthimopoulos, Lauro Gianola, Luca Scarnato,
Peter Diem, and Stavroula G. Mougiakakou, “A food
recognition system for Diabetic patients based on an
optimized Bag-of-Features Model,” IEEE Journal of
Biomedical and Health Informatics, July 2014.
Huanjing Wang, Taghi M. Khoshgoftaar, Amri Napolitano,
“Software measurement data reduction using ensemble
techniques,” Neurocomputing, ELSEVIER, 12 March 2012.
http://www.ijettjournal.org
Page 38
International Journal of Engineering Trends and Technology (IJETT) – Volume 36 Number 1- June 2016
[15]
[16]
[17]
[18]
[19]
Timothy P. Jurka, Loren Collingwood, Amber E.
Boydstun, Emiliano Grossman, Wouter van Atteveldt,
“Automatic text classification via supervised learning,” 19
February 2015.
The
New
York
Times
Annotated
Corpushttps://catalog.ldc.upenn.edu/LDC2008T19. last access: 28
January2016.
Jing Gaoy, Wei Fanz, Yizhou Suny, and Jiawei Hany,
“Heterogeneous Source Consensus Learning via Decision
Propagation and Negotiation,” 2009.
Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei
Han, “A Practical Approach to Classify Evolving Data
Streams: Training with Limited Amount of Labeled Data,”
Eighth IEEE International Conference on Data Mining,
2008.
Adil Fahad, Najla Alshatri, Zahir Tari, Abdullah
Alamri,Ibrahim Khalil, Albert Y. Zomay, Sebti Foufou,
and Abdelaziz Bouras, “A Survey of Clustering Algorithms
for Big Data: Taxonomy and Empirical Analysis,” IEEE
Transactions on Emerging Topics in Computing, 30
October 2014.
ISSN: 2231-5381
[20]
[21]
[22]
[23]
[24]
[25]
E. Rahm and H. H. Do, “Data cleaning: Problems and
current approaches,” IEEE Data Engineering Bulletin, 23,
2000.
Han Hu, Yon Ggang Wen, Tat-Seng Chua and Xuelongli,
“Toward Scalable Systems for Big Data Analytics: A
Technology Tutorial,” IEEE Access, July 8, 2014.
Peipei Xia, Li Zhang, Fanzhang Li, “Learning Similarity
with Cosine Similarity Ensemble,” Information Sciences,
ELSEVIER, 20 February 2015.
Biao Qin, Yuni Xia, Shan Wang, Xiaoyong Du, “A Novel
Bayesian Classification for Uncertain Data,” KnowledgeBased Systems, ELSEVIER, 27 April 2011.
Guodong Zhao, Yan Wu, Fuqiang Chen, Junming Zhang,
Jing Bai, “Effective Feature Selection using Feature Vector
Graph for Classification,” Neurocomputing, ELSEVIER, 30
September 2014.
Xue-Wen Chen and Xiaotong Lin, “Big Data Deep
Learning: Challenges and Perspectives,” IEEE Access,
May 28, 2014.
http://www.ijettjournal.org
Page 39
Download