International Journal of Engineering Trends and Technology (IJETT) – Volume 36 Number 1- June 2016 The Survey on Approaches to Efficient Clustering and Classification Analysis of Big Data Bhagyashri S. Gandhi 1, Leena A. Deshpande 2 Department of Computer Engineering, Vishwakarma Institute of Information Technology, Pune, India. Abstract— Analysts of different fields have shown a good interest in data mining. Data mining is the process of inferring useful patterns from the huge amount of data. Regarding data storage and management process, classical statistical models are however protective. Big data is a popular terminology which is intermittently discussed in the present day, used to describe the enormous quantity of data that may exist in any format. These populous data are so complex and dynamic in nature that makes it laborious for typical manipulating strategies to mine the appropriate knowledge from such huge amount of data. The fundamental purpose of this paper is to provide a comprehensive analysis of various techniques involved in mining big data and acknowledge the challenges associated with big data processing and management. Keywords— Big data, Data mining, Complex, Dynamic, Ensemble, Classification, Clustering, Distributed environment. I. INTRODUCTION The era of Big data is here: Big data is the data that grows every year. Revolution in scientific and technological facet has affected the size of data which increases on a daily basis with an aim to improve profitable activities. However, information retrieval and browsing have lead to bring an entirely progressive transformation by apprehending the whole that is readily available on cyberspace and produces it to the required ones in useful ways. This accumulates more than billions of bytes of data per day and casts exclusive data at regular intervals. Since the new arrival of data may be structured, unstructured or even complex and dynamic in nature; existing tools and techniques do not board themselves to acquainted data analysis mechanisms to commensurate with user requirements. As business and technology go hand-in-hand, their increasing dependence ensures that the data will continue to enlarge at a faster rate. Various sources provide diversification in data and at particular instance; one may want to use all of the available information to obtain optimal classification solution. Integration of this information sources is difficult to acquire as they have different structure ISSN: 2231-5381 formats. So may approach either to classification or clustering results in the area of privacy. Traditionally, single conventional classification models were implemented to solve big data challenges which require deep labelling action and is frequently the thing that we have relatively 20% of labelled occurrences to train a classifier model, but 80% of unlabeled occurrences are attainable to frame clusters from big data. This paper focuses to survey on classification and clustering techniques and classifies unstructured data to enhance prediction accuracy of data classification. In this paper, we present an ensemble classifier system for big data analysis. The rest of the paper is structured as follows. Section II presents related work. Section III presents the significance of the topic. Section IV presents the experimental analysis and gives its performance results. Section V concludes the work presented in this paper and Section VI draws directions for future work. II. RELATED WORK Eugenio Cesario, et. al. [1] proposed a bagging technique, also known as bootstraps aggregating, and is a popular ensemble learning mechanism. In this process, some bootstrap samples are depleted from original training data. Base classifiers are generated to train these fragments locally. Voting function is implemented among these classifiers to evaluate new sample data. New sample data is labeled by the class which receives the highest voting by participating classifiers. As sampling is done with replacement, redundant occurrences may emerge in the same bootstrap sample or some of them may even fail to appear, causing accuracy deterioration. The current technique of ensemble clustering proposed by Liping Jing, et. al. [2] integrates multiple clusters obtained from given data set into a single cluster of the same size to attain results better than individual clusters. Furthermore, this approach is enforced in a centralized framework with an expectation to use distributed computing environment in future work. Jie Hu, et. al. [3] proposed the Generation and Consensus function of cluster ensemble technique. http://www.ijettjournal.org Page 33 International Journal of Engineering Trends and Technology (IJETT) – Volume 36 Number 1- June 2016 In generation function, a set of partitions of original data set were composed using generative mechanism. In consensus function, a unique data set is produced by aggregating all the partitions generated in generation step. The existing system of ensemble technique lends basic disadvantages associated with consensus function. In consensus function, if all the participating classifiers acknowledge with the class label assigned, it is used to label the test instance. Otherwise, if any of the classifiers disagrees, the test instance is rejected while in majority voting, each participating classifier gives a vote for the class label prediction and a class with the maximum number of votes is treated as a final prediction of the multiple classifiers. If there is a tie of votes, the test data is dropped. It is clear that the former approach is more restrictive, and hence, higher performance is expected at the expenses of an increase in the rejection rates. The later approach does not need a total agreement among the classifiers, rejecting less amount test data [4]. All the above approaches employ clustering alone as part of proposed system. But clusters alone are not adequate for classification process as they deliver no absolute class label acquaintance. Peng Zhang, et. al. [5] proposed an ensemble for both clusters and classifiers to encounter this issue. In this system, ensemble learning operates on divide-andconquer approach, first to segregate the endless data streams into small data chunks and then create incompetent base classifiers from the small chunks. At the eventual stage, all the participating classifiers are combined for prediction. An approach to classifying cattle behaviour patterns such as grazing, ruminating, resting, walking, etc. along with unsupervised clustering to guide next stage of supervised machine learning proposed by Ritaban Dutta, et. al. [6] using sensor technology to identify and monitor behavioural changes in cattle. By exploiting the combined approach of supervised and unsupervised learning, an ensemble model can aid with a number of assets, such as accommodating rapidly to new test instances, attaining lower deviations as compared to individual model and can be easily correlated. Since the proposed system is implemented in a centralized structure, it may lack to deal with the computer aspects of CPU utilization and memory management and makes it difficult to process voluminous data to provide expected accuracy results unlike in distributed computing environment which allows each classifier to execute concurrently and thus increase the performance of the system. An IntelliHealth application is a medical decision support application where decisions are made based on weighted multi-layer classifier ensemble framework proposed by Saba Bashir, et. al. [7]. This approach depicts the comparison between the state ISSN: 2231-5381 of art technique and multi-layer classifier ensemble technique on accuracy, sensitivity and specificity. The framework is evaluated on five different heart disease datasets, four breast cancer datasets, two diabetes datasets, two liver disease datasets and one hepatitis dataset obtained from public repositories. Recognizing emotions in the text as a part of sentiment analysis using an ensemble of classifiers proposed by Isidoros Perikos, et. al. [8]. In this approach, the resulting ensemble classification prediction is based on majority voting conducted on all the participating classifiers to recognize emotion presence in text and also to identify the polarity of the emotions. Warapat Paireekreng, et. al. [9] proposed an approach to address the problem of determining the grade of a real estate project using a real estate ensemble classification technique. This approach helps loaners to make decisions for the further stage of a loan. Hence, it motivates to use the multi-classifier model in distributed environment to optimize class labels in big data and improve the accuracy as well as the efficiency of memory management. In Ensemble technique of multi-classifier model, a huge complex problem can be partitioned into many small problems that are easier to figure out. It helps to achieve a consolidated solution and reduce errors made by individual models. They are more robust and stable and improve the classification accuracy over single model method. Computer scientists, Doug Reed Cutting and Mike Cafarella created Hadoop back in 2005. Being an open source platform that supports processing of large data sets in a distributed computing environment, it helps to accomplish the execution of multiple classifiers simultaneously through MapReduce. Hadoop is Apaches free implementation of a MapReduce framework, the core idea behind MapReduce is mapping our dataset into a collection of <key, value> pairs and then reducing all pairs with the same key. III. SIGNIFICANCE Big data has been globally used to cart all sorts of objectives embedded in: a tremendous amount of data, communal analytics, data storage, processing and management abilities, real-time and archival data; complex, dynamic and unstructured data and much more. Undertaking these features associated with big data, there are enormous significant applications to deal with. Big data in combination with data processing mechanisms can be utilized in medical domain to examine and classify to label the disease [7], [10], [11]. Other than the need for big data analytics in the medical field, it is also suitable in Government domain in framing smart cities [12]. Several other advantages include: to detect fingerprints to classify them as legitimate or http://www.ijettjournal.org Page 34 International Journal of Engineering Trends and Technology (IJETT) – Volume 36 Number 1- June 2016 illegitimate for security purpose [4], to control traffic in peak times based on live streaming data about vehicles, using Bag-of-Features model to identify food contents for the diabetic patient [13], to monitor remote sensors for weather predictions. It can also be applied in the area where software programs are at high risks of defects. The ensemble of feature selection mechanism can be used to diagnose and correct fault predictions [14]. The accuracy of NaiveBayes classifier is 68.48%, accuracy of J48 classifier is 93.48%, accuracy of BayesNet classifier is 93.48% and accuracy of Ensemble classifier is 93.48%. IV. IMPLEMENTATION A. Dataset Electricity dataset, downloaded from UCI repository is used as data to test over individual and ensemble classifier. Electricity data is a sample dataset which consists of class labels as UP and DOWN. The format of this dataset contains five columns denoted by: Date, Day, Instances, Price and Demand. B. Dataset is implemented over 4 test beds Electricity dataset is implemented and tested over four classifiers. Among these four classifiers, three individual classifiers namely: NaiveBayes classifier J48 classifier BayesNet classifier The fourth is an Ensemble classifier with base classifiers as NaiveBayes, J48 and BayesNet classifier. C. Analysis of dataset on weekend The dataset is analyzed considering only 6th and 7th day of all week days. Fig. 1 NaiveBayes classifier output D. Result analysis on weekend There are many approaches used to calculate algorithmic performance at the accumulative level. When Electricity dataset is tested over four classifiers, we obtain the accuracy summarized in TABLE I using “recall” approach. . TABLE I Performance characteristics of Classifiers Classifiers Accuracy Classifier approach NaiveBayes J48 BayesNet Ensemble 68.48% 93.48% 93.48% 93.48% Individual Individual Individual Hybrid ISSN: 2231-5381 http://www.ijettjournal.org Page 35 International Journal of Engineering Trends and Technology (IJETT) – Volume 36 Number 1- June 2016 Fig. 4 Ensemble classifier output Fig. 2 J48 classifier output Fig. 5 Weekend graph E. Analysis of dataset on price and demand The dataset is analyzed for price value of 1700 and demand value of 7000. F. Result analysis on price and demand The accuracy of NaiveBayes classifier is 57.07%, accuracy of J48 classifier is 79.61%, accuracy of BayesNet classifier is 81.81% and accuracy of Ensemble classifier is 81.83%. Fig. 3 BayesNet classifier output ISSN: 2231-5381 http://www.ijettjournal.org Page 36 International Journal of Engineering Trends and Technology (IJETT) – Volume 36 Number 1- June 2016 Fig. 6 NaiveBayes classifier output Fig. 8 BayesNet classifier output Fig. 7 J48 classifier output Fig. 9 Ensemble classifier output ISSN: 2231-5381 http://www.ijettjournal.org Page 37 International Journal of Engineering Trends and Technology (IJETT) – Volume 36 Number 1- June 2016 One can even analyze the data for rest of the week days and for a particular time instance. V. CONCLUSIONS AND FUTURE SCOPE A novel survey on the practice of big data constitutes of huge classical heterogeneous data that are ever changing and complicated in nature which is gathered across several origins. These huge storehouses are beyond the capabilities of traditional database tools and techniques to recognize, evaluate and handle effortlessly. As availability and attainment of data are the two most critical issues in current date, which need to be handled since conventional data processing systems are inadequate to accomplish necessary support. To overcome these features of big data, an ensemble classification and clustering combination in distributed computing environment is proposed. Ensemble technique takes the dominance over single component classification model of being more stable, robust and accurate to many hardships incurred in the nature of data. This technique provides maximal sensitivity and F-measure when compared with the state of art techniques. Additional benefits to ensemble technique can be obtained by enforcing ensemble model in distributed environment like Hadoop and MapReduce to gain faster results by the simultaneous implementation of participating modules and scaling to the huge quantity of data. This paper gives details of approaches proposed for dealing with big data. This paper focus on both clustering and classification model of ensemble technique to deal with the gradual flow of data. The significant characteristic of endless data is its rate of arrival. Acceleration in arrival rate makes it crucial for conventional classification techniques to manage and store them efficiently in memory permanently because of two major issues: A. Volume The volume of data is so huge that only small measure of data can be handled and refined while the rest of the information is overlooked. B. Concept drift ACKNOWLEDGMENT We are especially thankful to Vivek Ghule who helped us during the writing of this paper. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] Concept drift is termed to represent the variations in data that occur over time. Sudden drift being another issue of interest shows accidental, instant and inevitable changes in data that should be handled efficiently. Thus, our future work is to deal with sudden drift in data and so to process this data successfully to analyze the class labels for test data. [12] [13] [14] ISSN: 2231-5381 Eugenio Cesario, Carlo Mastroianni, Domenico Talia, “Distributed volunteer computing for solving ensemble learning Problems,” Future Generation Computer Systems, ELSEVIER, 3 August 2015. Liping Jing, Kuang Tian, Joshua Z. Huang, “Stratified feature sampling method for ensemble clustering of high dimensional data,” Pattern Recognition, ELSEVIER, 13 May 2015. Jie Hu, Tianrui Li, Hongjun Wang, Hamido Fujita, “Hierarchical cluster ensemble model based on knowledge granulation,” Knowledge-Based Systems,ELSEVIER,16 October 2015. Mikel Galar, Joaquín Derrac, Daniel Peralta, Isaac Triguero, Daniel Paternain, Carlos Lopez-Molina, Salvador García, José M. Benítez, Miguel Pagola, Edurne Barrenechea, Humberto Bustince, Francisco Herrera, “A survey of fingerprint classification part II: experimental analysis and ensemble proposal,” Knowledge-Based Systems, ELSEVIER, 2015. Peng Zhang, Xingquan Zhu, Jianlong Tan, Li Guo, “Classifier and cluster ensembles for mining concept drifting data streams,” IEEE International Conference on Data Mining, 2010. Ritaban Dutta, Daniel Smith, Richard Rawnsley, Greg Bishop-Hurley, James Hills, Greg Timms, Dave Henry, “Dynamic cattle behavioural classification using supervised ensemble classifiers,” Computers and Electronics in Agriculture, ELSEVIER, 6 December 2014. Saba Bashir, Usman Qamar, Farhan Hassan Khan, “IntelliHealth: a medical decision support application using a novel weighted multi-layer classifier ensemble framework,” Journal of Biomedical Informatics, ELSEVIER, 15 December 2015. Isidoros Perikos, Ioannis Hatzilygeroudis, “Recognizing emotions in text using ensemble of classifiers,” Engineering Applications of Artificial Intelligence, 51, 191-201, ELSEVIER, 2016. Worapat Paireekreng, Worawat Choensawat, “An ensemble learning based model for real estate project classification,” 6th International Conference on Applied Human Factors and Ergonomics (AHFE 2015) and the Affiliated Conferences, AHFE 2015, ELSEVIER, 2015. João Cunhaa, Catarina Silva, Mário Antunes, “Health Twitter big data management with Hadoop framework,” Conference on ENTERprise Information Systems / International Conference on Project MANagement / Conference on Health and Social Care Information Systems and Technologies, CENTERIS / ProjMAN / HCist 2015 October 7-9, 2015, ELSEVIER, 2015. Dr. Saravana kumar N, Eswari, Sampath & Lavanya,” Predictive methodology for Diabetic data analysis in big data,” 2nd International Symposium on Big Data and Cloud Computing (ISBCC’15), ELSEVIER, 2015. J. Archenaa and E. A. Mary Anita,“A survey of big data analytics in healthcare and government,” 2nd International Symposium on Big Data and Cloud Computing (ISBCC’15), ELSEVIER, 2015. Marios M. Anthimopoulos, Lauro Gianola, Luca Scarnato, Peter Diem, and Stavroula G. Mougiakakou, “A food recognition system for Diabetic patients based on an optimized Bag-of-Features Model,” IEEE Journal of Biomedical and Health Informatics, July 2014. Huanjing Wang, Taghi M. Khoshgoftaar, Amri Napolitano, “Software measurement data reduction using ensemble techniques,” Neurocomputing, ELSEVIER, 12 March 2012. http://www.ijettjournal.org Page 38 International Journal of Engineering Trends and Technology (IJETT) – Volume 36 Number 1- June 2016 [15] [16] [17] [18] [19] Timothy P. Jurka, Loren Collingwood, Amber E. Boydstun, Emiliano Grossman, Wouter van Atteveldt, “Automatic text classification via supervised learning,” 19 February 2015. The New York Times Annotated Corpushttps://catalog.ldc.upenn.edu/LDC2008T19. last access: 28 January2016. Jing Gaoy, Wei Fanz, Yizhou Suny, and Jiawei Hany, “Heterogeneous Source Consensus Learning via Decision Propagation and Negotiation,” 2009. Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, “A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data,” Eighth IEEE International Conference on Data Mining, 2008. Adil Fahad, Najla Alshatri, Zahir Tari, Abdullah Alamri,Ibrahim Khalil, Albert Y. Zomay, Sebti Foufou, and Abdelaziz Bouras, “A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis,” IEEE Transactions on Emerging Topics in Computing, 30 October 2014. ISSN: 2231-5381 [20] [21] [22] [23] [24] [25] E. Rahm and H. H. Do, “Data cleaning: Problems and current approaches,” IEEE Data Engineering Bulletin, 23, 2000. Han Hu, Yon Ggang Wen, Tat-Seng Chua and Xuelongli, “Toward Scalable Systems for Big Data Analytics: A Technology Tutorial,” IEEE Access, July 8, 2014. Peipei Xia, Li Zhang, Fanzhang Li, “Learning Similarity with Cosine Similarity Ensemble,” Information Sciences, ELSEVIER, 20 February 2015. Biao Qin, Yuni Xia, Shan Wang, Xiaoyong Du, “A Novel Bayesian Classification for Uncertain Data,” KnowledgeBased Systems, ELSEVIER, 27 April 2011. Guodong Zhao, Yan Wu, Fuqiang Chen, Junming Zhang, Jing Bai, “Effective Feature Selection using Feature Vector Graph for Classification,” Neurocomputing, ELSEVIER, 30 September 2014. Xue-Wen Chen and Xiaotong Lin, “Big Data Deep Learning: Challenges and Perspectives,” IEEE Access, May 28, 2014. http://www.ijettjournal.org Page 39