International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 5- April 2016 Feature Selection using Machine Learning for Event Discovery in Social Media Dagmawet Tilahun Alemayhu 1, Prof. Shilpa Gite2 Department of Computer Science/ Information Technology Symbiosis Institute of Technology, Pune, India. Abstract -This paper introduces the concept of feature selection, general procedures, analysis criteria, and the characteristics of feature selection. Feature selection is an economical technique in addressing spatiality reduction. The hybrid approaches are typically effective compromise that joins the positive sides of the two approaches(filterwrapper)and limiting their influence of the drawbacks. In this paper we propose a totally distinctive hybrid filter-wrapper approach that exploits the speed of filter methodology followed by the wrapper’s accuracy.This paper analyzes feature selection method for selecting essential feature by using machine learning algorithm for event discovery in social media using filter. Furthermore wrapper method improves the efficiency of hybrid method using map reduction algorithm. Keywords: Machine learning,Feature selection algorithm, Filter, Wrapper, Map reduction algorithm. 1. INTRODUCTION The high dimension of today’s real-world information poses a significant downside for traditional classifiers. Therefore feature selection is also a standard pre-processing step in several data analysis algorithms. It prepares data for mining and machine learning, which aims to remodel data into business intelligence or information. Performing feature selection may have varied motivations. In machine learning and statistics, feature selection, in addition stated as variable selection, attribute selection or variable set selection, is that the strategy of selecting a collection of relevant choices (variables, predictors) to be employed in model construction. Feature selection techniques are mainly used for three reasons: a. Simplification of models to form which makethem easier to interpret by researchers/usersb. Shorter work time enhanced generalization by reducing over fitting (formally, reduction of variance). The fundamental principle of using feature selection technique is that the data contains many choices that unit either redundant or extraneous data and would possibly remove. Redundant or extraneous choices are 2 distinct notions, since one relevant feature may even be redundant among the presence of another relevant feature with that it's powerfully correlative. ISSN: 2231-5381 Feature selection techniques need to be distinguished from feature extraction. Feature extraction creates new choices from functions of the primary choices, whereas feature selection returns a collection of the options [8]. Feature selection techniques are usually utilized in domains where there are many choices and relatively few samples (or info points). Prototypic cases for the appliance of feature alternative embody the analysis of written texts and DNA microarray data, where there area unit many thousands of options, and lots of tens to several samples. A feature alternative formula is also seen as a result of the mix of a research technique for proposing new feature subsets, together with associate analysis live that scores the varied feature subsets. The simplest formula is to see each attainable set of choices finding the one that minimizes the error rate. This can be often associate thoroughgoing search of the realm, and is computationally uncontrollable for high number of feature sets. The choice of research metric heavily influences the formula, and it's these analysis metrics that distinguish between the three main categories of feature selection algorithms: wrapper, filter and hybrid approach. i. Wrapper method uses a prognostic model to urge feature subsets. Each new set is utilized to educate a model, that's tested on a hold-out set [16]. Numeration the number of mistakes created thereon hold-out set (the error rate of the model) offers the score for that set [11]. As wrapper ways train a fresh model for each set, they are really computationally intensive, but usually provide the most effective acting feature set for that actual style of model[10][12]. ii. Filter method uses a proxy live instead of the error rate to urge a feature set. This live is chosen to be fast to cypher, whereas still capturing the standard of the feature set. Common measures embody the mutual information, the point wise mutual information, Pearson product-moment constant of correlation, inter/intra class distance or the variant significance tests for each class/feature mixtures. Filters are usually less computationally intensive than wrappers, but they prove a feature set that won't tuned to a selected style of prognostic model. This lack of calibration suggests that a feature set from a filter isextra general than the set from a wrapper, usually giving lower prediction http://www.ijettjournal.org Page 199 International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 5- April 2016 performance than a wrapper. However the feature set doesn't contain the assumptions of a prediction model, and so is extra useful for exposing the relationships between the choices. Many filters provides a feature ranking rather than a certain best feature set, and thus the discontinue purpose inside the ranking is chosen via cross-validation [19]. Filter ways have to boot been used as a pre-processing step for wrapper ways, allowing a wrapper to be used on larger problems [10]. methodology has been explained. Section IV discussesabout the results and discussion. Section V conclusion and future work is discussed based on the results achieved. In ancient statistics, the foremost well-liked style of feature alternative is stepwise regression, which is a wrapper technique. It is a greedy algorithmic program that adds the most effective feature (or deletes the worst feature) at each spherical. The most management issue is deciding once to forestall the algorithm [5]. In machine learning, this can be often typically done by cross-validation. In statistics, some criteria are optimized. This finishes up within the inherent draw back of nesting. Extra durable ways are explored, like branch and sure and piecewise linear network. Feature selection method which is filter wrapper, hybrid methods are well known methods have been proposed [12][13][16]in which many of them follow the criterion of merging of method principle [1],[9],[13] for example . In existing system has not considered i. The computational cost and ii. Selected feature have to be taken into account in feature selection method. Problems in Social Media Analysis Social media are computer-mediated tools that enable people or companies to make, share, or exchange information, career interests, ideas, and pictures/videos in virtual communities and networks. The variability of complete and built-in social media services presently available introduces challenges of definition but there are some common features: (1) social media are web 2.0 internet-based applications, (2) user-generated content (UGC) is that the lifeblood of the social media organism, (3) users produce service-specific profiles for the site or app that are designed and maintained by the social media organization, and (4) social media facilitate the event of on-line social networks by connecting a user's profile with those of other individuals and/or groups. Social media depend upon mobile and web-based technologies to make extremely interactive platforms through that people and communities share, co-create, discuss, and modify user-generated content. During this system our focus is to improve the prevailing feature selection algorithm i.e. filter, wrapper, hybrid, and to proposed a brand new approach for feature selection. [17][18] Section II discusses the various related work conducted till date. In section III proposed ISSN: 2231-5381 2. RELATEDWORK Our work is closely related with this study of feature selection methods. We reviewed important related works in this area. Mutual Information Genetic algorithm (MI-GI) considers both two feature selection methods and studies the combined feature selection algorithm this follow mutual information filtration streamlined population initialization individual fitness calculation .The MI-GA algorithm failson the criteria because of its high complexity and long execution time. In contrast, we considered selecting relevant feature for event discovery on social media problem where the selector is only allowed to access a small and fixed number of features this is a significant challenging problem in most of the studies. There are many efforts undertaken by researchers in developing economicand political tools for acting several tasks of pre-processingin social media and presently most of social media based event discovery is commonly based on some classification model end with poor performance [2][16] . Multi-label text classification deals with problems during which every document is expounded to a group of categories. These documents typically comprehend an outsized variety of words, which can hinder the performance of learning algorithms. Feature alternative might be a regular task to representative words and remove unimportant once that might speeding learning and even improve learning performance. This work evaluates eight feature selection algorithms in text benchmark datasets [2]. In this paper, information pre-processing may be a crucial and demanding step within the method process and it is a large impact on the success of a data mining soil classification. data pre-processing might be a initial step of the information and knowledge the data discovery in databases (KDD) technique that reduces the quality of the data and offers higher analysis and ANN (artificial neural network )coaching supported the collected knowledge from the world additionally soil testing laboratory, data analysis is performed further accurately and efficiently [11]. Data pre- http://www.ijettjournal.org Page 200 International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 5- April 2016 processing is troublesome and tedious task as a result of it involves exhaustive manual effort and time in developing the data operation scripts [3]. The multi-dimensional classification downside might be a generalization of the recently-popularized task of multi-label classification, where each data instance is expounded to multiple class variables. There has been relatively little or no analysis dole out specific to multi-dimensional classification and, though one of the core goals is comparable (modeling dependencies among classes), there are important differences; significantly a better variety of achievable classifications. With this paper, we have proposed a methodology for multi-dimensional classification, drawing from the foremost relevant multi-label analysis, and mixing it with novel methodology. Usingthis fast methodology to model the conditional dependence between class variables, we sort superclass partitions and use them to make multidimensional learners, learning each super-class as a regular class, and then expressly modeling class dependencies [4]. Ananalysis needs to collect data from varied sources and analysis those data with some techniques for predict or decision creating method. once collections of varied data, main task it to maintain data apply transformation and preprocessing of enormous knowledge sets for that processing tools is needed. Presently data mining tools used for data mining unit on the market, uses either as ASCII computer file or industrial code [10]. Presently, to assemble the big volume of dataset at lesser worth, storage technology and data assortment has created it possible for any organisation.In order to induce the useful and convenient information, it is necessary to utilize the keep data for any any use. This overall finally ends up in processing thought.Today, processing may well be a replacement and important area in human life. data mining plays associate important role in varied fields like business, education, finance, healthsector etc.The main motive of knowledge|of datamining is to examine the knowledge from completely different perspective then label it and encapsulate it thus as to accumulate useful data by exploitation their varied new techniques and tools.Today, the various processing tools on the market that researchers needs for evaluating their data. In this paper, we've got a bent to overviewed fully different tools includes in data mining like, WEKA, fast jack, and KNIME. This paper presents pros and cons of each tools and in addition compare their parameters.By this comparitive study, it's created easy for the researchers to create a best option of the tool[11]. The existing feature selection method generally grouped into three categories, filter approach [1] ISSN: 2231-5381 [6][12]has become crucial in many classification settings, especially object recognition, recently faced with feature learning strategies that originate thousands of cues. Filter methods analyze intrinsic properties of data, ignoring the classifier. Most of these methods can perform two operations, ranking and subset selection: in the former, the importance of each individual feature is evaluated, usually by neglecting potential interactions among the elements of the joint set; in the latter, the final subset of features to be selected is provided. In some cases, these two operations are performed sequentially (first the ranking, then the selection); in other cases, only the selection is carried out. Filter methods suppress the least interesting variables. These methods are particularly effective in computation time and robust to over fitting [6] [7]. Wrapper approach [1] [9] evaluate subsets of variables which allows, unlike filter approaches, to detect the possible interactions between variables. The two main disadvantages of these methods are: The increasing over fitting risk when the number of observations is insufficient. The significant computation time when the number of variables is large. Hybrid approach [1] [9] [13] is a combination of both filter and wrapper approach. Researchers till now working on feature selection methods in order to improve performance efficiency and reducing number of features to get more accurate mining results.A common strategy on this improved algorithms using map reduction is proposed in order to increase performance, accuracy and reduce number of features and execution time. The novelty of our work is explained by proposing the hybrid approach which works on social media with better performance and less execution time. 3. Proposed methodology Fig. 1 Framework of proposed Feature Selection method http://www.ijettjournal.org Page 201 International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 5- April 2016 The above fig.1 shows the basic framework of selecting feature for event discovery in social media. For the purpose of our study we used twitter and micro blog dataset (http://www.kddcup2012.org/c/kddcup2012track1,http://snap.stanford.edu) and this social media data is first pre-processed for removing nose and missing value then processed using machine learning feature section algorithm to get relevant feature. Nowadays high utility item sets mining (HUIs) from the large datasets is becoming the very important task of data mining in which discovery of itemsets with high utilities. But the existing previous methods are representing a huge number of HUIs to the end users which resulted into inefficient performance of utility miningresult. 1: procedure REDUCE 2:INITIALIZE.LIST (P) 3: for all (k, f) ϵ [( 4: APPEND (P,(k,f)) 5: SORT (P) 6: EMIT (t, P) )] do Map reduction It has features to map the whole data file which we want to use and reduce this by splitting against sort this reduced data for finding the exact search we want. Finally reduced that sorted data for performing finding the feature. Therefore applying map reduction concepts to our improved algorithm, it helps for improving performance as well as computation. Finally our contribution is used to map-reduce framework to find HUIs from last dataset faster in comparison to the existing and recent methods [20][21]. Filter methods Filter method provides a ranking of features rather than an unambiguous best feature subset .Feature selected by using general characteristics of training data e.g. distance between class or statistical dependencies it have better generalization but is select more number of features .This method is fast to compute and have less computationally intensive than wrapper method also gives less prediction performance[1].There are a lot of algorithm that have been proposed most of well-known such as relief, correlation based feature selection fast correlated based filter interact chi square[6][9]. But our hybrid approach gives more correct results. Wrapper methods Fig. 2 Map reduction Algorithm To overcome these problems, we are presenting the hybrid model framework of HUIs with the goals of achieving the high efficiency for the mining task and provide a concise mining result to users using parallel data computing technology to process large dataset fast. The hybrid framework is proposed which is mining closed high utility item sets, which serves as a compact and lossless representation of HUIs. Algorithm for Map reduce 1: Procedure MAP(k,d) 2: INITIALIZE ASSOCIATIVEARRAY (H) 3: for all t ϵ d do 4: H{t}←H{t}+1 5: for all t ϵ H do 6: EMIT ISSN: 2231-5381 In this method for finding subset of feature number of searching algorithm is used which maximize classification performance. This model gives better performing feature set however it is computationally expensive and some of the algorithms are sequential selection, heuristics and genetic algorithm [6][9]. Hybrid methods Our proposed hybrid method has the advantage of both filter and wrapper models of feature selection [15][18]. MI-GA algorithm uses filter model that rank the feature by using mutual information between each feature then choose highest relevant features to the classed and wrapper algorithm explore space and optimize feature subset but it have higher performance and computational time therefore, our work is improvedthe problem using map reduction on social media bases [13]. http://www.ijettjournal.org Page 202 International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 5- April 2016 4. RESULTS AND DISCUSSION The analysis is based on calculating the accuracy to annotate feature and comparing theexisting and our proposed method.In fig. 3, the performance of existing system is shown in blue line and performance of proposed system is shownin red line.We can clearly see theperformance of proposed system is greater than the existing system. Here performance is measured in terms of accuracy, precision & recall values. More number of features increases the accuracy that too in less computational time. media dataset, feature selection is very important. Performing features selection procedure before discovering event is necessary for social media event discovery and other research topics. In this paper we have proposed an improved hybrid method for feature selection task in machine learning. We conducted analysis of different methods of feature selection by using social media (micro blog, twitter) based event discovery. We compared previous existing method which is filter wrapper and hybrid approaches with our improved hybrid method .The experimental results on a real social media dataset showed that the improved results in the performance (more number of states) and execution time. We intend to focus on device new hybrid algorithm by using different search strategy and consider event discovery on different social media platforms for our extension work in future. ACKNOWLEDGEMENT First and foremost, I must acknowledge and thank to the Almighty God for blessing, protecting and guiding me throughout this period. I express my sincere thanks to Prof. Shilpa Gite, Department of computer science /information technology, SIT University for her valuable guidance, support and motivation during entire period of work. Fig. 3 Performance in terms of accuracy REFERENCES [1] Fig. 4 Time required for the process Required Time of existing system is shown in blue line and Required Time of proposed System is shown in red line. It indicates that Required Time of proposed system is greater than existing system as shown in fig. 4. Hence we can state that our approach outperforms the existing approach on both terms of accuracy and time required to find relevant features. 5. CONCLUSION AND FUTURE WORK Careful selection of features may change the result analysis drastically so in the event discovery in social ISSN: 2231-5381 HanenMhamdi, Faouzi Mhamdi, Feature Selection Methods for Biological Knowledge Discovery, IEEE 2014. [2] Newton Spolaor1and GrigoriosTsoumakas”Evaluating Feature Selection Methods for Multi-Label Text Classification”,1Laboratory of Computational Intelligence Institute of Mathematics and Computer Science University of Sao Paulo Sao Carlos, Brazil , 2013 [3] S.S.Baskarl, Dr. L. Arockiam, S.Charles,”A Systematic Approach on Data Pre processing In Data Mining”,COMPUSOFT, An international journal of advanced computer technology, 2 (11), November-2013 (Volume-II, Issue-XI) [4] Jesse Read, Concha Bielza, Member, IEEE, and Pedro Larrañaga,”Multi Dimensional Classification with SuperClasses”,35610_INVE_MEM_2014_171766(1) [5] Guo-Ping Liu,1Jian-Jun Yan,2Yi-Qin Wang,1Jing-Jing Fu,1Zhao-Xia Xu,1Rui Guo,1and Peng Qian1 [6] Afef Ben Brahim, Mohamed Limam , Robust Ensemble Feature Selection for High Dimensional Data Sets,IEEE 2013. [7] Taghi M. Khoshgoftaar, AlirezaFazelpour, Huanjing Wang, A survery of stability Analysis of Feature Subset Selection, IEEE 2013. [8] Rashmi Dubey MS1,2, Jiayu Zhou BS1,2, Yalin Wang PhD1,Paul M. Thompson PhD3, Jieping Ye PhD1,2,For the Alzheimer’s Disease Neuroimaging Initiative*,”ANALYSIS OF SAMPLING TECHNIQUES FOR IMBALANCED DATA,2013 [9] RattanawadeePanthong, AnongnartSrivihok , Wrapper Feature Subset Selection For Dimension Reduction Based on Ensemble Learning Algorithm, 2015 [10] J. Hans and M. Kamber, Data Mining Concepts and Techniques, second edition 2006. http://www.ijettjournal.org Page 203 International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 5- April 2016 [11] Neha Chauhan1, Nisha Gautam2,”PARAMETRIC COMPARISON OF DATA MINING TOOLS”, international journal of advanced technology in engineering and science vol.3, issue11, november 15. [12] Kajal Naidu, Aparna Dhenge, KapilWankhade , Feature Selection Algorismfor Improving the Performance of Classification: A Survey, IEEE 2014 [13] Pan-shiTang,xiao-longTang,zhong-yuTao,jian-ping li, Research on feature selection algorithm based on mutual information and genetic algorithm, IEEE 2014 [14] M.Vijayakamal, Mulugu Narendhar,”A Novel Approach for WEKA & Study On Data Mining Tools”, International Journal of Engineering and Innovative Technology (IJEIT) Volume 2, Issue 2, August 2012. [15] MehrdadRostami, Parham Moradi, A Clustering Based Genetic Algorithm for FeatureSelection , IEEE 2014 [16] Shweta Srivastava, “Weka: A Tool for Data preprocessing, Classification, Ensemble, Clustering and Association Rule Mining”,International Journal of Computer Applications (0975 – 8887)Volume 88 – No.10, February 2014. [17] JieZhao ,Xueya Wang, Peiquan Jin , Feature selection for event discovery in social media: A comparative study, sciencedirect ,2016. [18] Poonam Sharma ,Vidisha, AbhisekMathur, SushilChaturvedi An Improved Fast Clustering-Based Feature Subset Selection Algorithm for Multi Featured dataset ,IEEE 2014. [19] Harshvardhan Solanki, “Comparative Study of Data Mining Tools and Analysiswith Unified Data Mining Theory”,nternational Journal of Computer Applications (0975 – 8887) Volume 75 – No.16, August 2013. [20] X Niyogi - Neural information processing systems, 2004 [21] Himanshu Kasliwal, Shatrughan ModiaNovel approach for reduction of dynamic range basesd on hybrid tone mapping operator, science direct 2015 ISSN: 2231-5381 http://www.ijettjournal.org Page 204