Paper_Sentiment_analysis_PICT_group_no50

advertisement
International Journal of Research In Science & Engineering
Volume: 1 Issue: 1
e-ISSN: 2394-8299
p-ISSN: 2394-8280
SOCIAL IMPACT POLL GENERATION SYSTEM FROM TWEETS
Gaurav Kodande1, Tejal Gandhi2, Mrunmayi Shiveshwarkar3, Anuradha Soni4
1
Student,
Computer
Department,
Pune
Institute
of
Computer
Technology,
gauravkodande1528@gmail.com
2
Student, Computer Department, Pune Institute of Computer Technology, tejalgandhi15@gmail.com
3
Student, Computer Department, Pune Institute of Computer Technology, mws1694@gmail.com
4
Student, Computer Department, Pune Institute of Computer Technology, anuradhasoni1426@gmail.com
ABSTRACT
Over the past decade humans have experienced exponential growth in the use of on- line resources, in particular
social media and microblogging websites such as Twitter. Twitter is a micro-blogging social networking website
that has a large and rapidly growing user base, where users can write maximum 140 characters long updates called
‘tweets’. Twitter allows people to convey their opinions and thoughts openly about any topic, social issue or a
product that they are interested in. Therefore, Twitter is a good medium to search for potentially interesting trends
regarding prominent topics in the news or popular culture. Sentiment analysis (or opinion mining) refers to the
use of natural language processing, text analysis and computational linguistics to identify and extract subjective
information in source material. This project focuses on implementing machine learning algorithm to extract an
audience sentiment relating to current affairs or a trending topic. We collect data from Twitter using it’s open
source API and store it in a Mongo DB database. Then we preprocess and clean the data to obtain a dataset, which
is then divided into training and test data sets. Using ‘Random Forest Algorithm’ over training data set we classify
the data and analyze it to generate an opinion poll about a social issue. In this way using sentimental analysis we
generate an opinion poll regarding a trending topic or a social issue from Twitter.
Keywords: Sentimental Analysis, Twitter, Random Forest Algorithm, Data Mining.
----------------------------------------------------------------------------------------------------------------------------1. Introduction
Twitter is a highly sought after social media, it’s user base is growing faster than any other micro blogging
site. Twitter comprises of 25% of world’s online population, which accounts to 700 million monthly users. 340 million
tweets are sent each day on various topics ranging from international politics, entertainment, media, sports and social
issues. Social media plays significant role in shaping public opinion across the globe, twitter in recent years has helped
shape two important social revolutions, in Egypt and Libya. Social issues can be highly critical for government to
decide future policies upon. Sentiment analysis of tweets can provide highly valuable analysis regarding citizens’
opinion on an issue or a topic. There are various ways to achieve purpose of sentiment analysis like supervised
learning, unsupervised learning, structured prediction, reinforcement learning etc. The proposed system will use
Random forest algorithm to classify data, which is a supervised learning method and more efficient compared to other
classification algorithms in the category.
1.2 Supervised Learning
Supervised learning refers to machine learning task of concluding a function using training data. Random
forest algorithm is a type of supervised and ensemble learning method. The term ‘ensemble method’ refers to method
which creates multiple hypothesis from same base learner.
IJRISE| www.ijrise.org|editor@ijrise.org
International Journal of Research In Science & Engineering
Volume: 1 Issue: 1
e-ISSN: 2394-8299
p-ISSN: 2394-8280
In order to solve a supervised learning problem, following steps should be followed.
1.
Infer about the type of training examples. User must decide the type of data to be used as training data set.
E.g. In case of test data, it could be a single character or a single line or multiple lines of text data.
2.
Gather a training data set, which should be representation of real world use of a function. Thus, a set of input
and corresponding output objects is collected.
E.g. tweets from twitter, product reviews from e-retail site.
3.
Determine input feature representation of learning function, accuracy of learning function depends on how
an input object is represented. For greater accuracy, an input object should be transformed into feature vector
with number of features describing corresponding input object.
E.g. In case of analysis of product review, reviews should be cleaned before classification, to have only
featuring words such as good, bad, excellent etc.
4.
Selecting the structure of learning function and corresponding learning algorithm.
E.g. Random Forest, Support vector mechanics, Relational vector mechanics.
5.
Now selected learning algorithm can be run on gathered training data set. Supervised learning method may
require user to control some parameters for better efficiency, by optimizing subset of training data set.
6.
Evaluation of the accuracy of learned function. Performance evaluation of resulting function should be
measured on a separate set called test set and not on training data set.
Supervised learning has multiple approaches and algorithms.
 Analytical learning
 Artificial Neural networks
 Random Forests
 Bayesian statistics
 Naive Bayes classifier
 Support vector machines
2. PROPOSED WORK
2.1 ARCHITECTURE
Fig-1: Architecture of the system
IJRISE| www.ijrise.org|editor@ijrise.org
International Journal of Research In Science & Engineering
Volume: 1 Issue: 1
e-ISSN: 2394-8299
p-ISSN: 2394-8280
According to proposed system, tweets related to a particular keyword or a set of keywords will be collected using
Twitter API, then it’ll be stored in Mongo DB database. Data Preprocessing and cleaning is performed by filtering
inactive links and user activity. Training dataset will be built from the subset of cleaned data. The cleaned data will
be stored in database again. Then test data set will be created for the classifier and the output of the classifier will be
analyzed. The analyzed results will be displayed graphically to the user. Pre-processing and cleaning the data involves
eliminating incomplete, noisy and inconsistent data. Data pre-processing involves tasks like removing URLs, as an
URL doesn’t contribute to analyse the sentiment or opinion. For example consider a sentence “I’ve logged on to
www.behappy.com, as I’m sad”, preceding sentence is negative but it becomes neutral because of presence of word
‘happy’, that’s why we need to eliminate URLs. After Pre-processing features are extracted. The random forest
algorithm is used to classify the training data set labelled with positive or negative sentiments.
2.2 RANDOM FOREST





Random forest is an ensemble classifier that consists of many decision trees and outputs the class which is
the mode of the class's output by individual trees.
Decision trees are individual learners that are combined. They are one of the most popular learning methods
commonly used for data exploration.
One type of decision tree is called CART, it stands for classification and regression tree.
CART is a greedy, top-down, recursive partitioning, that divides feature space into sets of disjoint rectangular
regions.
Random forest is one of the most accurate learning algorithms available, it runs efficiently on large databases
and it can handle thousands of input variables.

1.
Each tree is constructed using the following algorithm:
Let the number of training cases be N, and the number of variables in the classifier be M.
2.
The number m of input variables to be used to determine the decision at a node of the tree, should be known
in advance and m should be much less than M.
3.
Choose a training set for this tree by choosing n times with replacement from all N available training cases
(i.e. take a bootstrap sample). Use the rest of the cases to estimate the error of the tree, by predicting their
classes.
4.
For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate
the best split based on these m variables in the training set.
5.
Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).
For prediction a new sample is pushed down the tree. It is assigned the label of the training sample in the
terminal node it ends up in. This procedure is iterated over all trees in the ensemble, and the average vote of all trees
is reported as random forest prediction.
Simple data set with two predictors.
IJRISE| www.ijrise.org|editor@ijrise.org
Greedy Recursive partitioning.
International Journal of Research In Science & Engineering
Volume: 1 Issue: 1
e-ISSN: 2394-8299
p-ISSN: 2394-8280
Following flowchart explains working mechanism of Random Forest algorithm.
Fig-2: Mechanism of random forest algorithm.
3. ISSUES WITH SENTIMENT ANALYSIS
Object Identification: Object identification for election will refer to e.g. ‘#indianelection’. This problem is significant
because without knowing object regarding which sentimental analysis has to be done, analysis can’t proceed. This
issue can affect results in undesired ways. In typical case user would want to do sentimental analysis on a competing
issue, the system thus will have to separate relevant and irrelevant objects from each other.
e.g. A tweet regarding failure of latest HTC smartphone will make HTC as desired keyword or object, but the same
tweet may contain objects like Apple, Samsung, Sony etc. This is where system will have to separate relevant and
irrelevant objects.
Feature Extraction and Synonym matching: In the above example, desired keyword was HTC but current system finds
nouns and noun phrases mainly, it cannot distinguish between a verb feature and a noun feature. We also need to find,
match and group synonyms together as people often use different words or phrases for describing same entity.
Opinion Orientation: This part of the system should determine whether there exists an opinion on a feature in the
sentence. And also classify that opinion as positive, negative and neutral. Core issue is to identify sentimental words
like good, bad, worst, amazing, excellent, ok, not bad etc. This is the key area where sentimental analysis system can
be improved to become more and more efficient, for e.g. A tweet saying ‘this transport service stands nowhere close
to being good’ actually means that it’s bad and gives negative sense, but a traditional sentimental analysis system
would take it as a positive opinion.
IJRISE| www.ijrise.org|editor@ijrise.org
International Journal of Research In Science & Engineering
Volume: 1 Issue: 1
Sr.No
Module
1
Data
Collection
Test
id
1
Case Description
Output
Tweets in cross language.
Tweets ignored.
2
Posts on other social media linked
with user’s twitter account.
Post shared with twitter from
another
social
media
linked/connected with user’s
twitter account.
Abbreviations and acronyms used
in tweet.
Such posts will not be considered.
A single tweet contains multiple
sentimental words like ‘unhappy’,
‘not’, ‘sad’.
A single tweet expresses opinion
about multiple issues.
Group of words will be considered
to avoid wrong analysis.
3
2
Preprocessing
1
3
Classification
1
2
4
Analysis and
Display
e-ISSN: 2394-8299
p-ISSN: 2394-8280
1
Single user writes multiple tweets
on same issues.
This type of posts will
considered as regular tweets.
be
Abbreviation and acronyms will be
converted to proper words.
Keywords will be separated and
tweet will be considered for
multiple categories.
All the tweets will be aggregated as
a single tweet.
Table no.1: Unit Testing
4. CONCLUSION
In this paper we proposed a set of techniques of machine learning and data mining to classify the tweets and analyse
the sentiments of people about an issue using random forest classification algorithm. Accuracy of this model can be
highly increased by improving opinion orientation i.e. it’s capability to identify correct object and correct features.
5. ACKNOWLEDGEMENT
We are thankful to Prof. A. R. Deshpande and Prof. Snehal Shintre for making their resources available to us, their
significant recommendations and guidance. We are also thankful to college authorities for giving us obliged base and
backing.
6. REFERENCES
W. Medhat,A. Yousef, H. Korashy, ”A Framework of preparing corpora from Social Network sites for Sentiment
Analysis”, International Conference on In- formation Society (i-Society 2014), IEEE, Nov. 2014
[1]
W. Medhat,A. Yousef, H. Korashy, ”Component Analysis of a Sentiment Anal- ysis framework odifferent
corpora”, International Conference on Information Society (i-Society 2014), IEEE, Dec. 2014 [2]
V. Singh, S. Dubey, ”Opinion Mining and Analysis: A Literature Review”, 2014 5th International ConferenceConfluence The Next Generation Informa- tion Technology Summit (Confluence), IEEE, 2014 [3]
[4]
A. Sen, K. Rudra and S. Ghosh, ”Extracting Situational Awareness from Mi- croblogs during Disaster Events”,
IJRISE| www.ijrise.org|editor@ijrise.org
International Journal of Research In Science & Engineering
Volume: 1 Issue: 1
e-ISSN: 2394-8299
p-ISSN: 2394-8280
Social Networking Workshop, COMSNETS 2015, IEEE, 2015 Shokoufeh Salem Minab, Mehrdad Jalali, ”Online Analyzing of Texts in Social Network of Twitter”, First
International Congress on Technology, Communica- tion and Knowledge (ICTCK 2014) November, 26-27, 2014 Mashhad Branch, Islamic Azad University, Mashhad, Iran, IEEE, 2014 [5]
Pal-Christian S. Njlstad , Lars S. Hyster , Wei Wei and Jon Atle Gulla, ”Eval- uating Feature Sets and Classifiers
for Sentiment Analysis of Financial News”, 2014 IEEE/WIC/ACM International Joint Conferences on Web
Intelligence (WI) and Intelligent Agent Technologies (IAT), IEEE, 2014 [6]
Qian Zhang 1,2 Yong Xu 1, ”Block-based selection random forest for texture classification using multi-fractal
spectrum feature”, School of Computer Sci- ence and Engineering, South China and University of Technology,
Guangzhou 510006, China and Academic Affairs Office, Guizhou Minzu University, Guiyang 550025, China,
Springer ,03 April 2015 [7]
V. Kulkarni, P. Sinha, ”Effective Learning and Classification using Random For- est Algorithm”, International
Journal of Engineering and Innovative Technology (IJEIT), Volume 3, Issue 11, May 2014 [8]
L. Breiman, ”Random Forests Machine Learning”, Statistics Department Uni- versity of California Berkeley, CA
94720, January 2001
[9]
A. Gupte, S. Joshi, P. Gadgul, A. Kadam, ”Comparative Study of Classifica- tion Algorithms used in Sentiment
Analysis”, (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (5) , 2014
[10]
IJRISE| www.ijrise.org|editor@ijrise.org
Download