International Journal of Research In Science & Engineering Volume: 1 Issue: 1 e-ISSN: 2394-8299 p-ISSN: 2394-8280 SOCIAL IMPACT POLL GENERATION SYSTEM FROM TWEETS Gaurav Kodande1, Tejal Gandhi2, Mrunmayi Shiveshwarkar3, Anuradha Soni4 1 Student, Computer Department, Pune Institute of Computer Technology, gauravkodande1528@gmail.com 2 Student, Computer Department, Pune Institute of Computer Technology, tejalgandhi15@gmail.com 3 Student, Computer Department, Pune Institute of Computer Technology, mws1694@gmail.com 4 Student, Computer Department, Pune Institute of Computer Technology, anuradhasoni1426@gmail.com ABSTRACT Over the past decade humans have experienced exponential growth in the use of on- line resources, in particular social media and microblogging websites such as Twitter. Twitter is a micro-blogging social networking website that has a large and rapidly growing user base, where users can write maximum 140 characters long updates called ‘tweets’. Twitter allows people to convey their opinions and thoughts openly about any topic, social issue or a product that they are interested in. Therefore, Twitter is a good medium to search for potentially interesting trends regarding prominent topics in the news or popular culture. Sentiment analysis (or opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source material. This project focuses on implementing machine learning algorithm to extract an audience sentiment relating to current affairs or a trending topic. We collect data from Twitter using it’s open source API and store it in a Mongo DB database. Then we preprocess and clean the data to obtain a dataset, which is then divided into training and test data sets. Using ‘Random Forest Algorithm’ over training data set we classify the data and analyze it to generate an opinion poll about a social issue. In this way using sentimental analysis we generate an opinion poll regarding a trending topic or a social issue from Twitter. Keywords: Sentimental Analysis, Twitter, Random Forest Algorithm, Data Mining. ----------------------------------------------------------------------------------------------------------------------------1. Introduction Twitter is a highly sought after social media, it’s user base is growing faster than any other micro blogging site. Twitter comprises of 25% of world’s online population, which accounts to 700 million monthly users. 340 million tweets are sent each day on various topics ranging from international politics, entertainment, media, sports and social issues. Social media plays significant role in shaping public opinion across the globe, twitter in recent years has helped shape two important social revolutions, in Egypt and Libya. Social issues can be highly critical for government to decide future policies upon. Sentiment analysis of tweets can provide highly valuable analysis regarding citizens’ opinion on an issue or a topic. There are various ways to achieve purpose of sentiment analysis like supervised learning, unsupervised learning, structured prediction, reinforcement learning etc. The proposed system will use Random forest algorithm to classify data, which is a supervised learning method and more efficient compared to other classification algorithms in the category. 1.2 Supervised Learning Supervised learning refers to machine learning task of concluding a function using training data. Random forest algorithm is a type of supervised and ensemble learning method. The term ‘ensemble method’ refers to method which creates multiple hypothesis from same base learner. IJRISE| www.ijrise.org|editor@ijrise.org International Journal of Research In Science & Engineering Volume: 1 Issue: 1 e-ISSN: 2394-8299 p-ISSN: 2394-8280 In order to solve a supervised learning problem, following steps should be followed. 1. Infer about the type of training examples. User must decide the type of data to be used as training data set. E.g. In case of test data, it could be a single character or a single line or multiple lines of text data. 2. Gather a training data set, which should be representation of real world use of a function. Thus, a set of input and corresponding output objects is collected. E.g. tweets from twitter, product reviews from e-retail site. 3. Determine input feature representation of learning function, accuracy of learning function depends on how an input object is represented. For greater accuracy, an input object should be transformed into feature vector with number of features describing corresponding input object. E.g. In case of analysis of product review, reviews should be cleaned before classification, to have only featuring words such as good, bad, excellent etc. 4. Selecting the structure of learning function and corresponding learning algorithm. E.g. Random Forest, Support vector mechanics, Relational vector mechanics. 5. Now selected learning algorithm can be run on gathered training data set. Supervised learning method may require user to control some parameters for better efficiency, by optimizing subset of training data set. 6. Evaluation of the accuracy of learned function. Performance evaluation of resulting function should be measured on a separate set called test set and not on training data set. Supervised learning has multiple approaches and algorithms. Analytical learning Artificial Neural networks Random Forests Bayesian statistics Naive Bayes classifier Support vector machines 2. PROPOSED WORK 2.1 ARCHITECTURE Fig-1: Architecture of the system IJRISE| www.ijrise.org|editor@ijrise.org International Journal of Research In Science & Engineering Volume: 1 Issue: 1 e-ISSN: 2394-8299 p-ISSN: 2394-8280 According to proposed system, tweets related to a particular keyword or a set of keywords will be collected using Twitter API, then it’ll be stored in Mongo DB database. Data Preprocessing and cleaning is performed by filtering inactive links and user activity. Training dataset will be built from the subset of cleaned data. The cleaned data will be stored in database again. Then test data set will be created for the classifier and the output of the classifier will be analyzed. The analyzed results will be displayed graphically to the user. Pre-processing and cleaning the data involves eliminating incomplete, noisy and inconsistent data. Data pre-processing involves tasks like removing URLs, as an URL doesn’t contribute to analyse the sentiment or opinion. For example consider a sentence “I’ve logged on to www.behappy.com, as I’m sad”, preceding sentence is negative but it becomes neutral because of presence of word ‘happy’, that’s why we need to eliminate URLs. After Pre-processing features are extracted. The random forest algorithm is used to classify the training data set labelled with positive or negative sentiments. 2.2 RANDOM FOREST Random forest is an ensemble classifier that consists of many decision trees and outputs the class which is the mode of the class's output by individual trees. Decision trees are individual learners that are combined. They are one of the most popular learning methods commonly used for data exploration. One type of decision tree is called CART, it stands for classification and regression tree. CART is a greedy, top-down, recursive partitioning, that divides feature space into sets of disjoint rectangular regions. Random forest is one of the most accurate learning algorithms available, it runs efficiently on large databases and it can handle thousands of input variables. 1. Each tree is constructed using the following algorithm: Let the number of training cases be N, and the number of variables in the classifier be M. 2. The number m of input variables to be used to determine the decision at a node of the tree, should be known in advance and m should be much less than M. 3. Choose a training set for this tree by choosing n times with replacement from all N available training cases (i.e. take a bootstrap sample). Use the rest of the cases to estimate the error of the tree, by predicting their classes. 4. For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set. 5. Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier). For prediction a new sample is pushed down the tree. It is assigned the label of the training sample in the terminal node it ends up in. This procedure is iterated over all trees in the ensemble, and the average vote of all trees is reported as random forest prediction. Simple data set with two predictors. IJRISE| www.ijrise.org|editor@ijrise.org Greedy Recursive partitioning. International Journal of Research In Science & Engineering Volume: 1 Issue: 1 e-ISSN: 2394-8299 p-ISSN: 2394-8280 Following flowchart explains working mechanism of Random Forest algorithm. Fig-2: Mechanism of random forest algorithm. 3. ISSUES WITH SENTIMENT ANALYSIS Object Identification: Object identification for election will refer to e.g. ‘#indianelection’. This problem is significant because without knowing object regarding which sentimental analysis has to be done, analysis can’t proceed. This issue can affect results in undesired ways. In typical case user would want to do sentimental analysis on a competing issue, the system thus will have to separate relevant and irrelevant objects from each other. e.g. A tweet regarding failure of latest HTC smartphone will make HTC as desired keyword or object, but the same tweet may contain objects like Apple, Samsung, Sony etc. This is where system will have to separate relevant and irrelevant objects. Feature Extraction and Synonym matching: In the above example, desired keyword was HTC but current system finds nouns and noun phrases mainly, it cannot distinguish between a verb feature and a noun feature. We also need to find, match and group synonyms together as people often use different words or phrases for describing same entity. Opinion Orientation: This part of the system should determine whether there exists an opinion on a feature in the sentence. And also classify that opinion as positive, negative and neutral. Core issue is to identify sentimental words like good, bad, worst, amazing, excellent, ok, not bad etc. This is the key area where sentimental analysis system can be improved to become more and more efficient, for e.g. A tweet saying ‘this transport service stands nowhere close to being good’ actually means that it’s bad and gives negative sense, but a traditional sentimental analysis system would take it as a positive opinion. IJRISE| www.ijrise.org|editor@ijrise.org International Journal of Research In Science & Engineering Volume: 1 Issue: 1 Sr.No Module 1 Data Collection Test id 1 Case Description Output Tweets in cross language. Tweets ignored. 2 Posts on other social media linked with user’s twitter account. Post shared with twitter from another social media linked/connected with user’s twitter account. Abbreviations and acronyms used in tweet. Such posts will not be considered. A single tweet contains multiple sentimental words like ‘unhappy’, ‘not’, ‘sad’. A single tweet expresses opinion about multiple issues. Group of words will be considered to avoid wrong analysis. 3 2 Preprocessing 1 3 Classification 1 2 4 Analysis and Display e-ISSN: 2394-8299 p-ISSN: 2394-8280 1 Single user writes multiple tweets on same issues. This type of posts will considered as regular tweets. be Abbreviation and acronyms will be converted to proper words. Keywords will be separated and tweet will be considered for multiple categories. All the tweets will be aggregated as a single tweet. Table no.1: Unit Testing 4. CONCLUSION In this paper we proposed a set of techniques of machine learning and data mining to classify the tweets and analyse the sentiments of people about an issue using random forest classification algorithm. Accuracy of this model can be highly increased by improving opinion orientation i.e. it’s capability to identify correct object and correct features. 5. ACKNOWLEDGEMENT We are thankful to Prof. A. R. Deshpande and Prof. Snehal Shintre for making their resources available to us, their significant recommendations and guidance. We are also thankful to college authorities for giving us obliged base and backing. 6. REFERENCES W. Medhat,A. Yousef, H. Korashy, ”A Framework of preparing corpora from Social Network sites for Sentiment Analysis”, International Conference on In- formation Society (i-Society 2014), IEEE, Nov. 2014 [1] W. Medhat,A. Yousef, H. Korashy, ”Component Analysis of a Sentiment Anal- ysis framework odifferent corpora”, International Conference on Information Society (i-Society 2014), IEEE, Dec. 2014 [2] V. Singh, S. Dubey, ”Opinion Mining and Analysis: A Literature Review”, 2014 5th International ConferenceConfluence The Next Generation Informa- tion Technology Summit (Confluence), IEEE, 2014 [3] [4] A. Sen, K. Rudra and S. Ghosh, ”Extracting Situational Awareness from Mi- croblogs during Disaster Events”, IJRISE| www.ijrise.org|editor@ijrise.org International Journal of Research In Science & Engineering Volume: 1 Issue: 1 e-ISSN: 2394-8299 p-ISSN: 2394-8280 Social Networking Workshop, COMSNETS 2015, IEEE, 2015 Shokoufeh Salem Minab, Mehrdad Jalali, ”Online Analyzing of Texts in Social Network of Twitter”, First International Congress on Technology, Communica- tion and Knowledge (ICTCK 2014) November, 26-27, 2014 Mashhad Branch, Islamic Azad University, Mashhad, Iran, IEEE, 2014 [5] Pal-Christian S. Njlstad , Lars S. Hyster , Wei Wei and Jon Atle Gulla, ”Eval- uating Feature Sets and Classifiers for Sentiment Analysis of Financial News”, 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), IEEE, 2014 [6] Qian Zhang 1,2 Yong Xu 1, ”Block-based selection random forest for texture classification using multi-fractal spectrum feature”, School of Computer Sci- ence and Engineering, South China and University of Technology, Guangzhou 510006, China and Academic Affairs Office, Guizhou Minzu University, Guiyang 550025, China, Springer ,03 April 2015 [7] V. Kulkarni, P. Sinha, ”Effective Learning and Classification using Random For- est Algorithm”, International Journal of Engineering and Innovative Technology (IJEIT), Volume 3, Issue 11, May 2014 [8] L. Breiman, ”Random Forests Machine Learning”, Statistics Department Uni- versity of California Berkeley, CA 94720, January 2001 [9] A. Gupte, S. Joshi, P. Gadgul, A. Kadam, ”Comparative Study of Classifica- tion Algorithms used in Sentiment Analysis”, (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (5) , 2014 [10] IJRISE| www.ijrise.org|editor@ijrise.org