Feature Selection using Machine Learning for Event Discovery in Social Media

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 5- April 2016
Feature Selection using Machine Learning for
Event Discovery in Social Media
Dagmawet Tilahun Alemayhu 1, Prof. Shilpa Gite2
Department of Computer Science/ Information Technology
Symbiosis Institute of Technology, Pune, India.
Abstract -This paper introduces the concept of
feature selection, general procedures, analysis
criteria, and the characteristics of feature selection.
Feature selection is an economical technique in
addressing spatiality reduction. The hybrid
approaches are typically effective compromise that
joins the positive sides of the two approaches(filterwrapper)and limiting their influence of the drawbacks.
In this paper we propose a totally distinctive hybrid
filter-wrapper approach that exploits the speed of
filter methodology followed by the wrapper’s
accuracy.This paper analyzes feature selection
method for selecting essential feature by using
machine learning algorithm for event discovery in
social media using filter. Furthermore wrapper
method improves the efficiency of hybrid method using
map reduction algorithm.
Keywords: Machine learning,Feature selection
algorithm, Filter, Wrapper, Map reduction algorithm.
1. INTRODUCTION
The high dimension of today’s real-world
information poses a significant downside for
traditional classifiers. Therefore feature selection is
also a standard pre-processing step in several data
analysis algorithms. It prepares data for mining and
machine learning, which aims to remodel data into
business intelligence or information. Performing
feature selection may have varied motivations. In
machine learning and statistics, feature selection, in
addition stated as variable selection, attribute selection
or variable set selection, is that the strategy of
selecting a collection of relevant choices (variables,
predictors) to be employed in model construction.
Feature selection techniques are mainly used for three
reasons: a. Simplification of models to form which
makethem easier to interpret by researchers/usersb.
Shorter work time enhanced generalization by
reducing over fitting (formally, reduction of variance).
The fundamental principle of using feature selection
technique is that the data contains many choices that
unit either redundant or extraneous data and would
possibly remove. Redundant or extraneous choices are
2 distinct notions, since one relevant feature may even
be redundant among the presence of another relevant
feature with that it's powerfully correlative.
ISSN: 2231-5381
Feature selection techniques need to be distinguished
from feature extraction. Feature extraction creates new
choices from functions of the primary choices,
whereas feature selection returns a collection of the
options [8]. Feature selection techniques are usually
utilized in domains where there are many choices and
relatively few samples (or info points). Prototypic
cases for the appliance of feature alternative embody
the analysis of written texts and DNA microarray data,
where there area unit many thousands of options, and
lots of tens to several samples.
A feature alternative formula is also seen as a result
of the mix of a research technique for proposing new
feature subsets, together with associate analysis live
that scores the varied feature subsets. The simplest
formula is to see each attainable set of choices finding
the one that minimizes the error rate. This can be often
associate thoroughgoing search of the realm, and is
computationally uncontrollable for high number of
feature sets. The choice of research metric heavily
influences the formula, and it's these analysis metrics
that distinguish between the three main categories of
feature selection algorithms: wrapper, filter and hybrid
approach.
i. Wrapper method uses a prognostic model to urge
feature subsets. Each new set is utilized to educate a
model, that's tested on a hold-out set [16]. Numeration
the number of mistakes created thereon hold-out set
(the error rate of the model) offers the score for that
set [11]. As wrapper ways train a fresh model for each
set, they are really computationally intensive, but
usually provide the most effective acting feature set
for that actual style of model[10][12].
ii. Filter method uses a proxy live instead of the error
rate to urge a feature set. This live is chosen to be fast
to cypher, whereas still capturing the standard of the
feature set. Common measures embody the mutual
information, the point wise mutual information,
Pearson product-moment constant of correlation,
inter/intra class distance or the variant significance
tests for each class/feature mixtures. Filters are usually
less computationally intensive than wrappers, but they
prove a feature set that won't tuned to a selected style
of prognostic model. This lack of calibration suggests
that a feature set from a filter isextra general than the
set from a wrapper, usually giving lower prediction
http://www.ijettjournal.org
Page 199
International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 5- April 2016
performance than a wrapper. However the feature set
doesn't contain the assumptions of a prediction model,
and so is extra useful for exposing the relationships
between the choices. Many filters provides a feature
ranking rather than a certain best feature set, and thus
the discontinue purpose inside the ranking is chosen
via cross-validation [19]. Filter ways have to boot
been used as a pre-processing step for wrapper ways,
allowing a wrapper to be used on larger problems
[10].
methodology has been explained. Section IV
discussesabout the results and discussion. Section V
conclusion and future work is discussed based on the
results achieved.
In ancient statistics, the foremost well-liked style of
feature alternative is stepwise regression, which is a
wrapper technique. It is a greedy algorithmic program
that adds the most effective feature (or deletes the
worst feature) at each spherical. The most
management issue is deciding once to forestall the
algorithm [5]. In machine learning, this can be often
typically done by cross-validation. In statistics, some
criteria are optimized. This finishes up within the
inherent draw back of nesting. Extra durable ways are
explored, like branch and sure and piecewise linear
network.
Feature selection method which is filter wrapper,
hybrid methods are well known methods have been
proposed [12][13][16]in which many of them follow
the criterion of merging of method principle
[1],[9],[13] for example .
In existing system has not considered
i. The computational cost and
ii. Selected feature have to be taken into account in
feature selection method.
Problems in Social Media Analysis
Social media are computer-mediated tools that
enable people or companies to make, share, or
exchange information, career interests, ideas, and
pictures/videos in virtual communities and networks.
The variability of complete and built-in social media
services presently available introduces challenges of
definition but there are some common features: (1)
social media are web 2.0 internet-based applications,
(2) user-generated content (UGC) is that the lifeblood
of the social media organism, (3) users produce
service-specific profiles for the site or app that are
designed and maintained by the social media
organization, and (4) social media facilitate the event
of on-line social networks by connecting a user's
profile with those of other individuals and/or groups.
Social media depend upon mobile and web-based
technologies to make extremely interactive platforms
through that people and communities share, co-create,
discuss, and modify user-generated content. During
this system our focus is to improve the prevailing
feature selection algorithm i.e. filter, wrapper, hybrid,
and to proposed a brand new approach for feature
selection. [17][18]
Section II discusses the various related work
conducted till date. In section III proposed
ISSN: 2231-5381
2. RELATEDWORK
Our work is closely related with this study of feature
selection methods. We reviewed important related
works in this area.
Mutual Information Genetic algorithm (MI-GI)
considers both two feature selection methods and
studies the combined feature selection algorithm this
follow mutual information filtration streamlined
population initialization individual fitness calculation
.The MI-GA algorithm failson the criteria because of
its high complexity and long execution time.
In contrast, we considered selecting relevant feature
for event discovery on social media problem where
the selector is only allowed to access a small and fixed
number of features this is a significant challenging
problem in most of the studies.
There are many efforts undertaken by researchers in
developing economicand political tools for acting
several tasks of pre-processingin social media and
presently most of social media based event discovery
is commonly based on some classification model end
with poor performance [2][16] .
Multi-label text classification deals with problems
during which every document is expounded to a group
of categories. These documents typically comprehend
an outsized variety of words, which can hinder the
performance of learning algorithms. Feature
alternative might be a regular task to representative
words and remove unimportant once that might
speeding learning and even improve learning
performance. This work evaluates eight feature
selection algorithms in text benchmark datasets [2].
In this paper, information pre-processing may be a
crucial and demanding step within the method process
and it is a large impact on the success of a data mining
soil classification. data pre-processing might be a
initial step of the information and knowledge the data
discovery in databases (KDD) technique that reduces
the quality of the data and offers higher analysis and
ANN (artificial neural network )coaching supported
the collected knowledge from the world additionally
soil testing laboratory, data analysis is performed
further accurately and efficiently [11]. Data pre-
http://www.ijettjournal.org
Page 200
International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 5- April 2016
processing is troublesome and tedious task as a result
of it involves exhaustive manual effort and time in
developing the data operation scripts [3].
The multi-dimensional classification downside might
be a generalization of the recently-popularized task of
multi-label classification, where each data instance is
expounded to multiple class variables. There has been
relatively little or no analysis dole out specific to
multi-dimensional classification and, though one of
the core goals is comparable (modeling dependencies
among classes), there are important differences;
significantly a better variety of achievable
classifications. With this paper, we have proposed a
methodology for multi-dimensional classification,
drawing from the foremost relevant multi-label
analysis, and mixing it with novel methodology.
Usingthis fast methodology to model the conditional
dependence between class variables, we sort superclass partitions and use them to make multidimensional learners, learning each super-class as a
regular class, and then expressly modeling class
dependencies [4].
Ananalysis needs to collect data from varied
sources and analysis those data with some
techniques for predict or decision creating method.
once collections of varied data, main task it to
maintain data
apply
transformation
and
preprocessing of enormous knowledge sets for that
processing tools is needed. Presently data mining tools
used for data mining unit on the market, uses either as
ASCII computer file or industrial code [10].
Presently, to assemble the big volume of dataset at
lesser worth, storage technology and data assortment
has created it possible for any organisation.In order to
induce the useful and convenient information, it is
necessary to utilize the keep data for any any use.
This overall finally ends up in processing
thought.Today, processing may well be a replacement
and important area in human life. data mining plays
associate important role in varied fields like
business, education, finance, healthsector etc.The
main motive of knowledge|of datamining is to
examine the knowledge from completely different
perspective then label it and encapsulate it thus as to
accumulate useful data by exploitation their varied
new techniques and tools.Today, the various
processing tools on the market that researchers needs
for evaluating their data. In this paper, we've got a
bent to overviewed fully different tools includes in
data mining like, WEKA, fast jack, and KNIME.
This paper presents pros and cons of each tools and in
addition compare their parameters.By this comparitive
study, it's created easy for the researchers to create a
best option of the tool[11].
The existing feature selection method generally
grouped into three categories, filter approach [1]
ISSN: 2231-5381
[6][12]has become crucial in many classification
settings, especially object recognition, recently faced
with feature learning strategies that originate
thousands of cues. Filter methods analyze intrinsic
properties of data, ignoring the classifier. Most of
these methods can perform two operations, ranking
and subset selection: in the former, the importance of
each individual feature is evaluated, usually by
neglecting potential interactions among the elements
of the joint set; in the latter, the final subset of features
to be selected is provided. In some cases, these two
operations are performed sequentially (first the
ranking, then the selection); in other cases, only the
selection is carried out. Filter methods suppress the
least interesting variables. These methods are
particularly effective in computation time and robust
to over fitting [6] [7].
Wrapper approach [1] [9] evaluate subsets of variables
which allows, unlike filter approaches, to detect the
possible interactions between variables. The two main
disadvantages of these methods are:
The increasing over fitting risk when the
number of observations is insufficient.
The significant computation time when the
number of variables is large.
Hybrid approach [1] [9] [13] is a combination of both
filter and wrapper approach.
Researchers till now working on feature selection
methods in order to improve performance efficiency
and reducing number of features to get more accurate
mining results.A common strategy on this improved
algorithms using map reduction is proposed in order to
increase performance, accuracy and reduce number of
features and execution time. The novelty of our work
is explained by proposing the hybrid approach which
works on social media with better performance and
less execution time.
3. Proposed methodology
Fig. 1 Framework of proposed Feature Selection
method
http://www.ijettjournal.org
Page 201
International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 5- April 2016
The above fig.1 shows the basic framework of
selecting feature for event discovery in social media.
For the purpose of our study we used twitter and
micro
blog
dataset
(http://www.kddcup2012.org/c/kddcup2012track1,http://snap.stanford.edu) and this social media
data is first pre-processed for removing nose and
missing value then processed using machine learning
feature section algorithm to get relevant feature.
Nowadays high utility item sets mining (HUIs) from
the large datasets is becoming the very important task
of data mining in which discovery of itemsets with
high utilities. But the existing previous methods are
representing a huge number of HUIs to the end users
which resulted into inefficient performance of utility
miningresult.
1: procedure REDUCE
2:INITIALIZE.LIST (P)
3: for all (k, f) ϵ [(
4: APPEND (P,(k,f))
5: SORT (P)
6: EMIT (t, P)
)] do
Map reduction
It has features to map the whole data file which we
want to use and reduce this by splitting against sort
this reduced data for finding the exact search we want.
Finally reduced that sorted data for performing finding
the feature. Therefore applying map reduction
concepts to our improved algorithm, it helps for
improving performance as well as computation.
Finally our contribution is used to map-reduce
framework to find HUIs from last dataset faster in
comparison to the existing and recent methods
[20][21].
Filter methods
Filter method provides a ranking of features rather
than an unambiguous best feature subset .Feature
selected by using general characteristics of training
data e.g. distance between class or statistical
dependencies it have better generalization but is select
more number of features .This method is fast to
compute and have less computationally intensive than
wrapper method also gives less prediction
performance[1].There are a lot of algorithm that have
been proposed most of well-known such as relief,
correlation based feature selection fast correlated
based filter interact chi square[6][9]. But our hybrid
approach gives more correct results.
Wrapper methods
Fig. 2 Map reduction Algorithm
To overcome these problems, we are presenting the
hybrid model framework of HUIs with the goals of
achieving the high efficiency for the mining task and
provide a concise mining result to users using parallel
data computing technology to process large dataset
fast. The hybrid framework is proposed which is
mining closed high utility item sets, which serves as a
compact and lossless representation of HUIs.
Algorithm for Map reduce
1: Procedure MAP(k,d)
2: INITIALIZE ASSOCIATIVEARRAY (H)
3: for all t ϵ d do
4: H{t}←H{t}+1
5: for all t ϵ H do
6: EMIT
ISSN: 2231-5381
In this method for finding subset of feature number
of searching algorithm is used which maximize
classification performance. This model gives better
performing feature set however it is computationally
expensive and some of the algorithms are sequential
selection, heuristics and genetic algorithm [6][9].
Hybrid methods
Our proposed hybrid method has the advantage of
both filter and wrapper models of feature selection
[15][18]. MI-GA algorithm uses filter model that rank
the feature by using mutual information between each
feature then choose highest relevant features to the
classed and wrapper algorithm explore space and
optimize feature subset but it have higher performance
and computational time therefore, our work is
improvedthe problem using map reduction on social
media bases [13].
http://www.ijettjournal.org
Page 202
International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 5- April 2016
4. RESULTS AND DISCUSSION
The analysis is based on calculating the accuracy to
annotate feature and comparing theexisting and our
proposed method.In fig. 3, the performance of existing
system is shown in blue line and performance of
proposed system is shownin red line.We can clearly
see theperformance of proposed system is greater than
the existing system. Here performance is measured in
terms of accuracy, precision & recall values. More
number of features increases the accuracy that too in
less computational time.
media dataset, feature selection is very important.
Performing features selection procedure before
discovering event is necessary for social media event
discovery and other research topics.
In this paper we have proposed an improved hybrid
method for feature selection task in machine learning.
We conducted analysis of different methods of feature
selection by using social media (micro blog, twitter)
based event discovery.
We compared previous existing method which is
filter wrapper and hybrid approaches with our
improved hybrid method .The experimental results on
a real social media dataset showed that the improved
results in the performance (more number of states) and
execution time.
We intend to focus on device new hybrid algorithm
by using different search strategy and consider event
discovery on different social media platforms for our
extension work in future.
ACKNOWLEDGEMENT
First and foremost, I must acknowledge and thank
to the Almighty God for blessing, protecting and
guiding me throughout this period. I express my
sincere thanks to Prof. Shilpa Gite, Department of
computer science /information technology, SIT
University for her valuable guidance, support and
motivation during entire period of work.
Fig. 3 Performance in terms of accuracy
REFERENCES
[1]
Fig. 4 Time required for the process
Required Time of existing system is shown in blue
line and Required Time of proposed System is shown
in red line. It indicates that Required Time of
proposed system is greater than existing system as
shown in fig. 4. Hence we can state that our approach
outperforms the existing approach on both terms of
accuracy and time required to find relevant features.
5. CONCLUSION AND FUTURE WORK
Careful selection of features may change the result
analysis drastically so in the event discovery in social
ISSN: 2231-5381
HanenMhamdi, Faouzi Mhamdi, Feature Selection Methods
for Biological Knowledge Discovery, IEEE 2014.
[2] Newton Spolaor1and GrigoriosTsoumakas”Evaluating Feature
Selection
Methods
for
Multi-Label
Text
Classification”,1Laboratory of Computational Intelligence
Institute of Mathematics and Computer Science University of
Sao Paulo Sao Carlos, Brazil , 2013
[3] S.S.Baskarl, Dr. L. Arockiam, S.Charles,”A Systematic
Approach
on
Data
Pre
processing
In
Data
Mining”,COMPUSOFT, An international journal of advanced
computer technology, 2 (11), November-2013 (Volume-II,
Issue-XI)
[4] Jesse Read, Concha Bielza, Member, IEEE, and Pedro
Larrañaga,”Multi Dimensional Classification with SuperClasses”,35610_INVE_MEM_2014_171766(1)
[5] Guo-Ping Liu,1Jian-Jun Yan,2Yi-Qin Wang,1Jing-Jing
Fu,1Zhao-Xia Xu,1Rui Guo,1and Peng Qian1
[6] Afef Ben Brahim, Mohamed Limam , Robust Ensemble
Feature Selection for High Dimensional Data Sets,IEEE 2013.
[7] Taghi M. Khoshgoftaar, AlirezaFazelpour, Huanjing Wang, A
survery of stability Analysis of Feature Subset Selection, IEEE
2013.
[8] Rashmi Dubey MS1,2, Jiayu Zhou BS1,2, Yalin Wang
PhD1,Paul M. Thompson PhD3, Jieping Ye PhD1,2,For the
Alzheimer’s Disease Neuroimaging Initiative*,”ANALYSIS
OF SAMPLING TECHNIQUES FOR IMBALANCED
DATA,2013
[9] RattanawadeePanthong, AnongnartSrivihok , Wrapper Feature
Subset Selection For
Dimension Reduction Based on
Ensemble Learning Algorithm, 2015
[10] J. Hans and M. Kamber, Data Mining Concepts and
Techniques, second edition 2006.
http://www.ijettjournal.org
Page 203
International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 5- April 2016
[11] Neha
Chauhan1,
Nisha
Gautam2,”PARAMETRIC
COMPARISON
OF DATA MINING TOOLS”,
international journal of advanced technology in engineering
and science vol.3, issue11, november 15.
[12] Kajal Naidu, Aparna Dhenge, KapilWankhade , Feature
Selection Algorismfor Improving the Performance of
Classification: A Survey, IEEE 2014
[13] Pan-shiTang,xiao-longTang,zhong-yuTao,jian-ping
li,
Research on feature selection algorithm based on mutual
information and genetic algorithm, IEEE 2014
[14] M.Vijayakamal, Mulugu Narendhar,”A Novel Approach for
WEKA & Study On Data Mining Tools”, International
Journal of Engineering and Innovative Technology (IJEIT)
Volume 2, Issue 2, August 2012.
[15] MehrdadRostami, Parham Moradi, A Clustering Based
Genetic Algorithm for FeatureSelection , IEEE 2014
[16] Shweta Srivastava, “Weka: A Tool for Data preprocessing,
Classification, Ensemble, Clustering and Association Rule
Mining”,International Journal of Computer Applications
(0975 – 8887)Volume 88 – No.10, February 2014.
[17] JieZhao ,Xueya Wang, Peiquan Jin , Feature selection for
event discovery in social media: A comparative study,
sciencedirect ,2016.
[18] Poonam Sharma ,Vidisha, AbhisekMathur, SushilChaturvedi
An Improved Fast Clustering-Based Feature Subset Selection
Algorithm for Multi Featured dataset ,IEEE 2014.
[19] Harshvardhan Solanki, “Comparative Study of Data Mining
Tools
and
Analysiswith
Unified
Data
Mining
Theory”,nternational Journal of Computer Applications
(0975 – 8887) Volume 75 – No.16, August 2013.
[20] X Niyogi - Neural information processing systems, 2004
[21] Himanshu Kasliwal, Shatrughan ModiaNovel approach for
reduction of dynamic range basesd on hybrid tone mapping
operator, science direct 2015
ISSN: 2231-5381
http://www.ijettjournal.org
Page 204
Download