Stage-II Project Repor

A PRELIMINARY PROJECT REPORT ON “EVENT BASED SENTIMENT ANALYSIS ON TWITTER USING MACHINE LEARNING ALGORITHMS” SUBMITTED TO THE SAVITRIBAI PHULE PUNE UNIVERSITY, PUNE IN THE PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE AWARD OF THE DEGREE OF BACHELOR OF ENGINEERING (COMPUTER ENGINEERING) SUBMITTED BY PIYUSH KULKARNI B150534305 OMKAR YELPALE B150534418 PIYUSH JADHAV B150534357 ASHWINI BAGADE B150534211 DEPARTMENT OF COMPUTER ENGINEERING ZEAL COLLEGE OF ENGINEERING AND RESEARCH NARHE, PUNE – 411041 ZCOER, Department of Computer 1 SAVITRIBAI PHULE PUNE UNIVERSITY 2020 -2021 CERTIFICATE This is to certify that the project report entitles “EVENT BASED SENTIMENT ANALYSIS ON TWITTER USING MACHINE LEARNING ALGORITHMS” Submitted by PIYUSH KULKARNI B150534305 OMKAR YELPALE B150534418 PIYUSH JADHAV B150534357 ASHWINI BAGADE B150534211 is a bonafide student of this institute and the work has been carried out by him/her under the supervision of Prof. R. T. Waghmode and it is approved for the partial fulfilment of the requirement of Savitribai Phule Pune University, for the award of the degree of Bachelor of Engineering (Computer Science Engineering). Prof. R. T. Waghmode Project Guide Department of Computer Engineering Prof. External Examiner Prof. A. V. Mote) Head Department of Computer Engineering Dr. Ajit M. Kate Principal, Zeal College of Engineering & Research ZCOER, Department of Computer 2 SPPU Place: Pune Date: / / ACKNOWLEDGEMENT We whole heartedly express our deepest gratitude to Prof. Rupali Waghmode Ma’am for the constant help and guidance during the whole process of project development. We are also thankful to our project guide for encouragement given for the completion of the project We would like to express deepest appreciation towards Dr. Ajit M. Kate, Principal, Zeal college of Engineering and Research Narhe, Pune, Prof. A. V. Mote, Head of Computer Engineering Department whose invaluable guidance supported us in completing this project. Lastly, we are also thankful to every faculty and staff of the Computer Department for their kind co-operation and help. NAME OF THE STUDENTS: PIYUSH KULKARNI OMKAR YELPALE PIYUSH JADHAV ZCOER, Department of Computer 3 ASHWINI BAGADE ABSTRACT This project addresses the problem of sentiment analysis in twitter; that is classifying tweets according to the sentiment expressed in them: positive, negative or neutral. Twitter is an online micro-blogging and social-networking platform which allows users to write short status updates of maximum length 140 characters. It is a rapidly expanding service with over 200 million registered users out of which 100 million are active users and half of them log on twitter on a daily basis – generating nearly 250 million tweets per day. Due to this large amount of usage we hope to achieve a reflection of public sentiment by analysing the sentiments expressed in the tweets. Analysing the public sentiment is important for many applications such as firms trying to find out the response of their products in the market, predicting political elections and predicting socioeconomic phenomena like stock exchange. The aim of this project is to develop a functional classifier for accurate and automatic sentiment classification of an unknown tweet stream. In this project we have used the python library to stream the live raw data from the twitter and we will combine it with the standardised dataset. Then using some Machine Learning algorithms, we will carry out the sentiment analysis. This project targets the tweets about the specific event so it will be helpful to use this project analyse the impact of that event over the social media. ZCOER, Department of Computer 4 TABLE OF CONTENTS LIST OF ABBREVIATIONS LIST OF FIGURES 6 7 CHAPTER PAGE NO. 1. Introduction 1.1. Overview 1.2. Motivation 1.3. Problem Definition 09 1.4. Objectives 1.5. Project Scope 1.6. Limitations 1.7. Methodologies of Problem Solving 08 08 08 09 09 09 10 2. Literature Survey 11 3. Software Requirements specification 3.1. Assumptions and Dependencies 3.2. Functional Requirements 3.3. Non-Functional Requirements 13 13 13 13 3.3.1. Performance Requirements Requirements 3.3.3. Security Requirements 3.3.4. Software Quality Attributes 3.4. System Requirements 13 3.3.2. Safety 14 14 14 14 3.4.1. Database Requirements 3.4.2. Software Requirements 3.4.3. Hardware Requirements 3.5. Analysis Models: SDLC Model to be Applied 14 14 15 15 4. System Design 4.1. System Architecture 4.2. Data Flow Diagram 17 17 5. Project Plan 5.1. Project Estimate 5.2. Risk Management 5.3. Project Schedule 5.4. Team Organizations 5.5. Team Structures 5.6. Management Reporting and Communication 5.7. Timeline Chart 22 22 23 24 24 24 24 25 ZCOER, Department of Computer 5 6. Project Implementations 6.1. Overview of Project Moules and Technologies Used 6.3. Algorithms Details 26 26 6.2. Tools 27 27 7. Results 30 8. Other Specifications 36 8.1. Advantages 8.2. Limitations 8.3. Applications 34 34 34 9. Conclusion & Future Work 38 10. References 40 ZCOER, Department of Computer 6 LIST OF ABBREVIATIONS ABBREVIATION ILLUSTRATION SVM Support Vector Machine ML Machine Learning SDLC Software Development Lifecycle UML Unified Modelling Language ZCOER, Department of Computer 7 LIST OF FIGURES FIGURE ILLUSTRATION 1 Development Process 2 System Architecture 3 System Flow Diagram ZCOER, Department of Computer PAGE NO. 16 19 20 8 INTRODUCTION 1.1 Overview We have chosen to work with twitter since we feel it is a better approximation of public sentiment as opposed to conventional internet articles and web blogs. The reason is that the amount of relevant data is much larger for twitter, as compared to traditional blogging sites. Moreover, the response on twitter is more prompt and also more general (since the number of users who tweet is substantially more than those who write web blogs on a daily basis). Sentiment analysis of the public is highly critical in macro-scale socioeconomic phenomena like predicting the stock market rate of a particular firm. This could be done by analysing overall public sentiment towards that firm with respect to time and using economic tools for finding the correlation between public sentiment and the firm’s stock market value. Firms can also estimate how well their product is responding in the market, which areas of the market is it having a favourable response and in which a negative response (since twitter allows us to download a stream of geo-tagged tweets for particular locations. If firms can get this information, they can analyse the reasons behind geographically differentiated response, and so they can market their product in a more optimized manner by looking for appropriate solutions like creating suitable market segments. Predicting the results of popular political elections and polls is also an emerging application to sentiment analysis. One such study was conducted by Tumasjan et al. in Germany for predicting the outcome of federal elections which concluded that twitter is a good reflection of offline sentiment. 1.2 Motivation Nowadays, the popular social networking sites like Twitter, Facebook, Instagram, YouTube and etc. are in trend. The main aim of present work is Sentiment Analysis to dig up the person’s behaviour, mood, opinion, experience from text data. More than tons of text data on social sites are not written in proper manner, by collecting information manually from that unstructured data is a very difficult task. The main motivation for choosing this domain for the project is we felt that it is immensely necessary to analyse the behaviour of the people during certain events and how it gets reflected on Social Media. We chose Twitter for the lysis because we can get tons of data about people’s tweets in very raw format. By using this system, we are intended to analyse the sentiments of a specific ZCOER, Department of Computer 9 person’s tweets or the overall tweets that we could analyse during the particular event. In the recent past India has faced so many problems such as Covid-19 pandemic, Farmer’s protest, Student’s protest etc. we can see that those were the topics which were vastly discussed over social media. We felt that if we can analyse these events collectively then it will help in finding the solution for the problem, analysing the feelings of the people etc. 1.3 Problem Definition We live in an era of social media, people use the different types of social media platforms to express their thoughts, opinions, etc. This system tries to do Sentimental analysis of those posts, tweets using machine learning algorithms and Natural Language Processing. 1.4 Objectives Following are the three main Objectives of the project: 1. Analysis of a particular event or situation. 2. Identifying the polarity 3. Usage 1.5 Project Scope This project will be helpful to the company's political parties as well as to the common people. It will be helpful to political parties for reviewing about the program that they are going to do or the program they have performed, similarly companies also can get review about their new product on newly released hardware or software. Also, the movie maker can take review on the currently running movie. By analysing the Twitter analyser, we can get how positive or negative or neutral are people about it. 1.6 Limitations ● The precision level of the prediction varies according to the Database provided and algorithms used. ● Limited to only one social media platform. ZCOER, Department of Computer 10 1.7 Methodologies of Problem Solving The System for Sentiment Analysis is divided into 4 parts as follows 1. Training Data: Collecting dataset (tweets) that are done by users. Here we have a collection of 40,000 datasets for the training purpose gathered from GitHub, Kaggle. 2. Pre-processing: With regards to improving the performance of analysis, the compulsion is to do pre-processing of data before exploring it. At first the process of tokenize (splitting of sequence of a string) is initiated, the input stream converted into separate words. 3. Feature Extraction: For making the sentiment analysis model, we have to extract every single feature from the text data which are widely categorized into morphological features, word N-gram features. 4. Classification Model: In our Sentiment Analysis experiment model, machine learning algorithms are used as a classifier and trained this classifier over the training sample. LITERATURE SURVEY Social Media Sentiment Analysis on Twitter Datasets Authors: Shikha Tiwari, Anshika Verma, Peeyush Garg, Deepika Bansal ZCOER, Department of Computer 11 Methodology: 1. Retrieve tweets 2. Pre-Processing 3. Sentiment Analysis Sentiment Analysis of Twitter Data: A Survey of Techniques Authors: Vishal A. Kharde, S.S. Sonawane Methodology: 1. Supervised Learning 2. Unsupervised Learning 3. Naïve Bayes Twitter Text Mining for Sentiment Analysis on People’s Feedback about Oman Tourism Authors: Vallikannu Ramanathan, T.Meyyappan Methodology: 1. Domain specific Ontology Entity 2. Specific Opinion Extraction 3. Lexicon based Approach Sentiment Analysis of Twitter Data Author: Radhi D. Desai Methodology: 1. Retrieve tweets 2. Pre-Processing 3. Sentiment Score 4. Review of sentiments Natural Language Processing for Sentiment Analysis Authors: Wei Yen Chong, Bhawani Selvaretnam, Lay-Ki Soon Methodology: 1. Subjectivity Classification 2. Semantic Association 3. Polarity Classification ZCOER, Department of Computer 12 Short Survey on Naive Bayes Algorithm Authors: Pouria Kaviani, Mrs. Sunita Dhotre Methodology: 1. Retrieve tweets 2. Pre-Processing 3. Semantic Association 4. Polarity Classification SOFTWARE REQUIREMENTS SPECIFICATION 3.1 Assumptions and Dependencies: ZCOER, Department of Computer 13 The software heavily depends on the availability and quality of the Database used and ML Algorithm used. Better the Dataset and algorithm, more accurate would be the Sentiment Analysis. 3.2 Functional Requirements: Functional requirements are statement of services the system should provide, how the system should react to particular inputs and how the system should behave in particular situation. ● The Dataset trained by machine learning algorithms. ● Machine learns this classification algorithms. ● Classifies the text input provided according to the polarity. 3.3 Non-Functional Requirements: Non-functional requirements define system properties and constraints it arises through user needs, because of budget constraints or organizational policies, or due to the external factors such as safety regulations, privacy registration and so on. 3.3.1 Performance Requirements: The software should be able to give the expected output quickly based on the given input parameter quickly in this case the optimal sentiment analysis is expected. 3.3.2 Safety Requirements: No safety requirements needed. 3.3.3 Security Requirements: No requirements for security. ZCOER, Department of Computer 14 3.3.4 Software Quality Attributes: The software should be able to give the expected output quickly based on the given input parameter quickly in this case the optimal sentiment analysis is expected. Also, the output should be well categorised without any ambiguity. 3.4 System Requirements: 3.4.1 Database Requirements ● Database: MySQL 3.4.2 Software Requirements (Platform choice) ● Operating System: Windows 10 ● Technology: Python 3.5.4 ● Front End: Mobile Application Hardware Requirements ● Processor: At least android version 2.0 ● RAM: 4 GB ● Hard Disk: 8 GB 3.5 Analysis Models: SDLC Model to be applied: • Requirement gathering and analysis: ZCOER, Department of Computer 15 In this step of waterfall, we identify what are the various requirements are needed for our project such as software and hardware required, database, and interfaces. • System Design: In system design phase, we design the system which is easily understandable for end user i.e. user friendly. We design some UML diagrams and data ow diagram to understand the system ow and system module and sequence of execution. • Implementation: In implementation phase of our project, we have implemented the various modules required for getting expected outcome at the different module levels. With inputs from system design, the system is first developed in small programs called units, which are integrated in the next phase. Each unit is developed and tested for its functionality which is referred to as Unit Testing. • Testing: The different test cases are performed to test whether the project modules are giving expected outcome in assured time or not. All the units developed in the implementation phase are integrated into a system after testing of each unit. Post integration, the entire system is tested for any faults and failures. • Deployment of System: Once the functional and non-functional testing is done, the product is deployed in the customer environment or released into the market. • Maintenance: After successfully deployment of the system here all the maintenance are Considered while managing all system functionally. ZCOER, Department of Computer 16 ● We have used the agile SDLC model to follow the project development process Development Process SYSTEM DESIGN 4.1 System Architecture: ZCOER, Department of Computer 17 The process of designing a functional classifier for sentiment analysis can be broken down into five basic categories. They are as follows: I. Data Acquisition II. Feature Extraction III. Classification Data Acquisition: Data in the form of raw tweets is acquired by using the python library “tweestream” which provides a package for simple twitter streaming API. This API allows two modes of accessing tweets: SampleStream and FilterStream. SampleStream simply delivers a small, random sample of all the tweets streaming at a real time. FilterStream delivers tweets which match a certain criterion. It can filter the delivered tweets according to three criteria: • Specific keyword(s) to track/search for in the tweets • Specific Twitter user(s) according to their user-id’s • Tweets originating from specific location(s) (only for geo-tagged tweets). Feature Extraction: Now that we have arrived at our training set, we need to extract useful features from it which can be used in the process of classification. But first we will discuss some text formatting techniques which will aid us in feature extraction: • Tokenization: It is the process of breaking a stream of text up into words, symbols and other meaningful elements called “tokens”. Tokens can be separated by whitespace characters and/or punctuation characters. It is done so that we can look at tokens as individual components that make up a tweet. ZCOER, Department of Computer 18 • Punctuation marks and digits/numerals may be removed if for example we wish to compare the tweet to a list of English words. • Stemming: It is the text normalizing process of reducing a derived word to its root or stem. For example, a stemmer would reduce the phrases “stemmer”, “stemmed”, “stemming” to the root word “stem”. Advantage of stemming is that it makes comparison between words simpler, as we do not need to deal with complex grammatical transformations of the word. • Stop-words removal: Stop words are classes of some extremely common words which hold no additional information when used in a text and are thus claimed to be useless [19]. Examples include “a”, “an”, “the”, “he”, “she”, “by”, “on”, etc. It is sometimes convenient to remove these words because they hold no additional information since they are used almost equally in all classes of text. • Parts-of-Speech Tagging: POS-Tagging is the process of assigning a tag to each word in the sentence as to which grammatical part of speech that word belongs to, i.e., noun, verb, adjective, adverb, coordinating conjunction etc. Classification: Pattern classification is the process through which data is divided into different classes according to some common patterns which are found in one class which differ to some degree with the patterns found in the other classes. The ultimate aim of our project is to design a classifier which accurately classifies tweets in the following four sentiment classes: positive, negative, neutral and ambiguous ZCOER, Department of Computer 19 SYSTEM ARCHITECTURE 4.2 Data Flow Diagram: Describing and documenting data is essential in ensuring that the researcher, and others who may need to use the data, can make sense of the data and understand the processes that have been followed in the collection, processing, and analysis of the data. Research data are any physical and/or digital materials that are collected, observed, or created in research activity for purposes of analysis to produce original research results or creative works. ZCOER, Department of Computer 20 System Flow Diagram PROJECT PLAN ZCOER, Department of Computer 21 5.1 Project Estimate Function Estimated LOC User Interface Methods 600 211 69 420 87 Data Pre-processing Developing the algorithm Connecting the Modules Application Program Interface 5.2 Risk Management • Risk Identification: Risk identification in project management is the core task within the risk management process to describe and classify risks in the project. By means of risk identification software tools, all the information gathered and analysed during the identification of risks serves as a foundation for further risk analysis, evaluation and estimation. In this project the risk is about predicting wrong diseases with the symptoms provided by the user. • Risk Analysis: Risk analysis in project management are also intended to provide project leadership with contingency information for scheduling, budgeting, and project control purposes, as well as provide tools to support decision making and risk management as the project progresses through planning and implementation. In this project the risk of predicting wrong disease is overcome by using 3 algorithm prediction result. • Overview of Risk Mitigation, Monitoring, Management Risk Mitigation: To mitigate this risk, project management must develop a strategy for predicting correct result. The possible steps to be taken are: ZCOER, Department of Computer 22 1. Selecting multiple algorithms for prediction (naïve bayes, random forest, decision tree). 2. Mitigate those causes that are under our control before the project starts. 3. Once the project commences, assume prediction will occur and develop techniques to ensure maximum accuracy of prediction. 4. Organize project teams so that information about each development activity is widely dispersed. 5. Define documentation standards and establish mechanisms to ensure that documents are developed in a timely manner. Risk Monitoring: As the project proceeds, risk monitoring activities commence. The project manager monitors factors that may provide an indication of whether the risk is becoming more or less likely. In the case of project management, the following factors can be monitored: 1. General attitude of team members based on project pressures. 2. Interpersonal relationships among team members. 3. Problem solving with the algorithms selected. Risk Management: Risk management and contingency planning assumes that mitigation efforts have failed and that the risk has become a reality. Continuing the example, the project is well underway, and there is a problem of predicting result as accurate as possible. If the mitigation strategy has been followed, result will be predicted accurately, information is documented, and knowledge has been dispersed across the team. 5.3 PROJECT SCHEDULE Phase Task Description Phase-1 Analysis Analyse the information given in the IEEE paper. ZCOER, Department of Computer 23 Phase-2 Literature survey Collect raw data and elaborate on literature surveys. Phase-3 Design Assign the module and design the process flow control. Phase-4 Implementation Implement the code for all the modules and integrate all the modules. Phase-5 Testing Test the code and overall process weather the process works properly. Phase-6 Documentation Prepare the document for this project with conclusion and future enhancement. 5.4 TEAM ORGANIZATIONS The team of 4 students have been formed to design and develop the proposed System. 5.5 TEAM STRUCTURES The team structure for the project is identified. Roles are defined. 5.6 MANAGEMENT REPORTING AND COMMUNICATIONS Mechanisms for progress reporting and inter/intra team communication are identified as per assessment sheet and lab time table. 5.7 TIMELINE CHART Activity Year 2020 ZCOER, Department of Computer Year 2021 24 Literature survey and Review, synopsis ✔ ✔ Project architecture, design, selection of algorithm Designing and developing the user interface for the user ✔ ✔ ✔ ✔ Designing features for project Designing and developing the Django application ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ Integrating Twitter API & fetching tweets Connecting the tweets with algorithm and integration with UI ✔ ✔✔ Preparation of Research paper and publishing Report writing ✔ ✔ ✔ ✔ ✔✔ ✔ ✔ ✔✔ PROJECT IMPLEMENTATIONS ZCOER, Department of Computer 25 6.1 OVERVIEW OF PROJECT MODULES This project will be helpful to the company's political parties as well as to the common people. It will be helpful to political parties for reviewing about the program that they are going to do or the program they have performed, similarly companies also can get review about their new product on newly released hardware or software. Also, the movie maker can take review on the currently running movie. By analysing the Twitter analyser, we can get how positive or negative or neutral are people about it. 1. User Interface Module: User enters a search query for particular event (e.g., Covid19) and its overall sentiment analysis for most recent tweets is displayed as graphical representations (Pie Chart). 2. Prediction Module: In this module, pre-define labelled dataset is provided to the python script to train and pre-process the module using Machine Learning algorithms. The module is then used to analyse the sentiment of given dataset into three categories i.e., Positive, Negative and Neutral. 3. Twitter Module: This module communicates with Twitter API and fetches the tweets depending on the search query. 4. Django Module: The prediction module and the Twitter module is then integrated with Django framework along with the API which interacts with User Interface. 6.2 TOOLS AND TECHNOLOGIES I) TOOLS USED: ZCOER, Department of Computer 26 • Android Studios • VS Code • Google Colab II) TECHNOLOGIES: • Django • Flutter 6.3 ALGORITHM DETAILS i) Random Forest: ● It is an ensemble classifier using many decision trees models; it can be used for regression as well as classification. ● A random forest is the classifier consisting of a collection of tree structured classifiers k, where the k is independently, identically distributed random trees and each random tree consists of the unit of vote for classification of input. ● Random forest uses the Gini index for the classification and determining the final class in each tree. ● The final class of each tree is aggregated and voted by the weighted values to construct the final classifier. ● The working of random forest is, A random seed is chosen which pulls out at a random, a collection of samples from the training datasets while maintaining the class distribution. ZCOER, Department of Computer 27 ii) Naïve Bayes: ● It is used to predict the categorical class labels. ● It classifies the class data based on the training set and the values in a classifying attribute and uses it in classifying new data. ● It is a two-step process Model Construction and Model Usage. ● This Bayes theorem is named after Thomas Bayes and it is a statistical method for classification and supervised learning method. ● It can solve both categorical and continuous values attributes. ● Bayes theorem finds the probability of an event occurring given the probability of another event that has already occurred. Bayes theorem is stated mathematically as the following equation. P(A/B) = P(B|A) P(A)/P(B) ZCOER, Department of Computer 28 iii) Logistic Regression: • It is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. • In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) as a function of X. • Multinomial logistic regression is an extension of logistic regression that adds native support for multi-class classification problems. • The multinomial logistic regression algorithm is an extension to the logistic regression model that involves changing the loss function to cross-entropy loss and predict probability distribution to a multinomial probability distribution to natively support multi-class classification problems. ZCOER, Department of Computer 29 RESULTS 1. Django: Sentiment analysis of top 50 tweets regarding “Django” query Sentiment analysis of top 50 tweets regarding “Farmer’s Protest” ZCOER, Department of Computer 30 2. Flutter Screen’s: ZCOER, Department of Computer 31 User input for to fetch tweets ZCOER, Department of Computer 32 sentiment analysis for event – “Farmer’s Protest” ZCOER, Department of Computer 33 ZCOER, Department of Computer 34 sentiment analysis for event – “covid19” sentiment analysis for event – “cyclone” ZCOER, Department of Computer 35 OTHER SPECIFICATIONS 8.1 Advantages i) The use of this information can be applied to make wiser decisions related to the use of resources, to make improvements in organizations, providing better products/services, and ultimately to improve the citizen lifestyle and the human relations in order to achieve a better society. ii) Social media is the current environment for data collection and analysis of sentiments of people. People can share and comment on everything, from personal thoughts to common events or topics in society. The access to social media also can provide more information in the form of hidden metadata. For instance, Operating System language, device type, capture time and geographical location. 8.2 Limitations i) Despite the possible positive outcomes shown, there are some disadvantages in applying automatic analysis due to the difficulty to implement it because of the ambiguity of natural language and also the characteristics of the posted content. ii) Sentiment analysis tools can identify and analyse many pieces of text automatically and quickly. But computer programs have problems recognizing things like sarcasm and irony, negations, jokes, and exaggerations - the sorts of things a person would have little trouble identifying. And failing to recognize these can skew the results. 8.3 Applications i) The results from sentiment analysis help businesses understand the conversations and discussions taking place about them, and helps them react and take action accordingly. They can quickly identify any negative sentiments being expressed, and turn poor customer experiences into very good ones. ZCOER, Department of Computer 36 ii) By listening to and analysing comments on Facebook and Twitter, local government departments can gauge public sentiment towards their department and the services they provide, and use the results to improve services such as parking and leisure facilities, local policing, and the condition of roads iii) Universities can use sentiment analysis to analyse student feedback and comments garnered either from their own surveys, or from online sources such as social media. They can then use the results to identify and address any areas of student dissatisfaction, as well as identify and build on those areas where students are expressing positive sentiments. ZCOER, Department of Computer 37 CONCLUSION AND FUTURE WORK The task of sentiment analysis, especially in the domain of micro-blogging, is still in the developing stage and far from complete. So, we propose a couple of ideas which we feel are worth exploring in the future and may result in further improved performance. Right now, we have worked with only the very simplest unigram models; we can improve those models by adding extra information like closeness of the word with a negation word. We could specify a window prior to the word (a window could for example be of 2 or 3 words) under consideration and the effect of negation may be incorporated into the model if it lies within that window. The closer the negation word is to the unigram word whose prior polarity is to be calculated, the more it should affect the polarity. For example, if the negation is right next to the word, it may simply reverse the polarity of that word and farther the negation is from the word the more minimized ifs effect should be. Apart from this, we are currently only focusing on unigrams and the effect of bigrams and trigrams may be explored. As reported in the literature review section when bigrams are used along with unigrams this usually enhances performance. However, for bigrams and trigrams to be an effective feature we need a much more labelled data set than our meagre 9,000 tweets. Right now, we are exploring Parts of Speech separate from the unigram models, we can try to incorporate POS information within our unigram models in future. So, say instead of calculating a single probability for each word like P (word | obj) we could instead have multiple probabilities for each according to the Part of Speech the word belongs to. For example we may have P(word | obj, verb), P (word | obj, noun) and P (word | obj, adjective). Pang et al. used a somewhat similar approach and claims that appending POS information for every unigram result in no significant change in performance (with Naive Bayes performing slightly better and SVM having a slight decrease in performance), while there is a significant decrease in accuracy if only adjective unigrams are used as features. However, these results are for classification of reviews and may be verified for sentiment analysis on micro blogging websites like Twitter. One more feature we that is worth exploring is whether the information about relative position of word in a tweet has any effect on the performance of the classifier. Although Pang et al. explored a similar feature and reported negative results, their results were based on reviews which are very different from tweets and they worked on an extremely simple model. In this research we are focussing on general sentiment analysis. There is potential of work in the field of sentiment analysis with partially known context. ZCOER, Department of Computer 38 For example, we noticed that users generally use our website for specific types of keywords which can divided into a couple of distinct classes, namely: politics/politicians, celebrities, products/brands, sports/sportsmen, media/movies/music. So, we can attempt to perform separate sentiment analysis on tweets that only belong to one of these classes (i.e., the training data would not be general but specific to one of these categories) and compare the results we get if we apply general sentiment analysis on it instead. ZCOER, Department of Computer 39 REFERENCES [1]. Rosenthal, Sara, Noura Farra, and Preslav Nakov. "SemEval-2017 task 4: Sentiment analysis in Twitter." Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 2017 [2]. Pontiki, Maria, et al. "SemEval-2016 task 5: Aspect based sentiment analysis." ProWorkshop on Semantic Evaluation (SemEval-2016). Association for Computational Linguistics, 2016. [3]. Wikipedia.com/ [4]. https://www.sciencedirect.com/ [5].https://ieeexplore.ieee.org/document/8951367 ZCOER, Department of Computer 40

Stage-II Project Repor

Related documents

Products

Support

Stage-II Project Repor

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib