Uploaded by mukeshkamble1042000

Stage-II Project Repor

advertisement
A PRELIMINARY PROJECT REPORT ON
“EVENT BASED SENTIMENT ANALYSIS ON TWITTER USING
MACHINE LEARNING ALGORITHMS”
SUBMITTED TO THE SAVITRIBAI PHULE PUNE UNIVERSITY, PUNE
IN THE PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE AWARD OF THE DEGREE
OF
BACHELOR OF ENGINEERING (COMPUTER ENGINEERING)
SUBMITTED BY
PIYUSH KULKARNI
B150534305
OMKAR YELPALE
B150534418
PIYUSH JADHAV
B150534357
ASHWINI BAGADE
B150534211
DEPARTMENT OF COMPUTER ENGINEERING
ZEAL COLLEGE OF ENGINEERING AND RESEARCH
NARHE, PUNE – 411041
ZCOER, Department of Computer
1
SAVITRIBAI PHULE PUNE UNIVERSITY
2020 -2021
CERTIFICATE
This is to certify that the project report entitles
“EVENT BASED SENTIMENT ANALYSIS ON TWITTER USING
MACHINE LEARNING ALGORITHMS”
Submitted by
PIYUSH KULKARNI
B150534305
OMKAR YELPALE
B150534418
PIYUSH JADHAV
B150534357
ASHWINI BAGADE
B150534211
is a bonafide student of this institute and the work has been carried out by
him/her under the supervision of Prof. R. T. Waghmode and it is approved
for the partial fulfilment of the requirement of Savitribai Phule Pune
University, for the award of the degree of Bachelor of Engineering
(Computer Science Engineering).
Prof. R. T. Waghmode
Project Guide
Department of Computer Engineering
Prof.
External Examiner
Prof. A. V. Mote)
Head
Department of Computer Engineering
Dr. Ajit M. Kate
Principal, Zeal College of Engineering & Research
ZCOER, Department of Computer
2
SPPU
Place: Pune
Date:
/
/
ACKNOWLEDGEMENT
We whole heartedly express our deepest gratitude to Prof. Rupali Waghmode
Ma’am for the constant help and guidance during the whole process of project
development. We are also thankful to our project guide for encouragement given for the
completion of the project
We would like to express deepest appreciation towards
Dr. Ajit M. Kate, Principal, Zeal college of Engineering and Research Narhe,
Pune, Prof. A. V. Mote, Head of Computer Engineering Department whose
invaluable guidance supported us in completing this project.
Lastly, we are also thankful to every faculty and staff of the Computer
Department for their kind co-operation and help.
NAME OF THE STUDENTS:
PIYUSH KULKARNI
OMKAR YELPALE
PIYUSH JADHAV
ZCOER, Department of Computer
3
ASHWINI BAGADE
ABSTRACT
This project addresses the problem of sentiment analysis in twitter; that is
classifying tweets according to the sentiment expressed in them: positive,
negative or neutral. Twitter is an online micro-blogging and social-networking
platform which allows users to write short status updates of maximum length
140 characters. It is a rapidly expanding service with over 200 million
registered users out of which 100 million are active users and half of them log
on twitter on a daily basis – generating nearly 250 million tweets per day. Due
to this large amount of usage we hope to achieve a reflection of public
sentiment by analysing the sentiments expressed in the tweets. Analysing the
public sentiment is important for many applications such as firms trying to
find out the response of their products in the market, predicting political
elections and predicting socioeconomic phenomena like stock exchange. The
aim of this project is to develop a functional classifier for accurate and
automatic sentiment classification of an unknown tweet stream. In this project
we have used the python library to stream the live raw data from the twitter
and we will combine it with the standardised dataset. Then using some
Machine Learning algorithms, we will carry out the sentiment analysis. This
project targets the tweets about the specific event so it will be helpful to use
this project analyse the impact of that event over the social media.
ZCOER, Department of Computer
4
TABLE OF CONTENTS
LIST OF ABBREVIATIONS
LIST OF FIGURES
6
7
CHAPTER
PAGE NO.
1. Introduction
1.1. Overview
1.2. Motivation
1.3. Problem Definition
09 1.4. Objectives
1.5. Project Scope
1.6. Limitations
1.7. Methodologies of Problem Solving
08
08
08
09
09
09
10
2. Literature Survey
11
3. Software Requirements specification
3.1. Assumptions and Dependencies
3.2. Functional Requirements
3.3. Non-Functional Requirements
13
13
13
13
3.3.1. Performance Requirements
Requirements
3.3.3. Security Requirements
3.3.4. Software Quality Attributes
3.4. System Requirements
13 3.3.2. Safety
14
14
14
14
3.4.1. Database Requirements
3.4.2. Software Requirements
3.4.3. Hardware Requirements
3.5. Analysis Models: SDLC Model to be Applied
14
14
15
15
4. System Design
4.1. System Architecture
4.2. Data Flow Diagram
17
17
5. Project Plan
5.1. Project Estimate
5.2. Risk Management
5.3. Project Schedule
5.4. Team Organizations
5.5. Team Structures
5.6. Management Reporting and Communication
5.7. Timeline Chart
22
22
23
24
24
24
24
25
ZCOER, Department of Computer
5
6. Project Implementations
6.1. Overview of Project Moules
and Technologies Used
6.3. Algorithms Details
26
26 6.2. Tools
27
27
7. Results
30
8. Other Specifications
36
8.1. Advantages
8.2. Limitations
8.3. Applications
34
34
34
9. Conclusion & Future Work
38
10. References
40
ZCOER, Department of Computer
6
LIST OF ABBREVIATIONS
ABBREVIATION
ILLUSTRATION
SVM
Support Vector Machine
ML
Machine Learning
SDLC
Software Development Lifecycle
UML
Unified Modelling Language
ZCOER, Department of Computer
7
LIST OF FIGURES
FIGURE
ILLUSTRATION
1
Development Process
2
System Architecture
3
System Flow Diagram
ZCOER, Department of Computer
PAGE NO.
16
19
20
8
INTRODUCTION
1.1 Overview
We have chosen to work with twitter since we feel it is a better approximation
of public sentiment as opposed to conventional internet articles and web blogs.
The reason is that the amount of relevant data is much larger for twitter, as
compared to traditional blogging sites. Moreover, the response on twitter is
more prompt and also more general (since the number of users who tweet is
substantially more than those who write web blogs on a daily basis). Sentiment
analysis of the public is highly critical in macro-scale socioeconomic
phenomena like predicting the stock market rate of a particular firm. This could
be done by analysing overall public sentiment towards that firm with respect to
time and using economic tools for finding the correlation between public
sentiment and the firm’s stock market value. Firms can also estimate how well
their product is responding in the market, which areas of the market is it having
a favourable response and in which a negative response (since twitter allows us
to download a stream of geo-tagged tweets for particular locations. If firms can
get this information, they can analyse the reasons behind geographically
differentiated response, and so they can market their product in a more
optimized manner by looking for appropriate solutions like creating suitable
market segments. Predicting the results of popular political elections and polls is
also an emerging application to sentiment analysis. One such study was
conducted by Tumasjan et al. in Germany for predicting the outcome of federal
elections which concluded that twitter is a good reflection of offline sentiment.
1.2 Motivation
Nowadays, the popular social networking sites like Twitter, Facebook,
Instagram, YouTube and etc. are in trend. The main aim of present work is
Sentiment Analysis to dig up the person’s behaviour, mood, opinion, experience
from text data. More than tons of text data on social sites are not written in
proper manner, by collecting information manually from that unstructured data
is a very difficult task.
The main motivation for choosing this domain for the project is we felt that
it is immensely necessary to analyse the behaviour of the people during certain
events and how it gets reflected on Social Media. We chose Twitter for the lysis
because we can get tons of data about people’s tweets in very raw format. By
using this system, we are intended to analyse the sentiments of a specific
ZCOER, Department of Computer
9
person’s tweets or the overall tweets that we could analyse during the particular
event.
In the recent past India has faced so many problems such as Covid-19
pandemic, Farmer’s protest, Student’s protest etc. we can see that those were the
topics which were vastly discussed over social media. We felt that if we can
analyse these events collectively then it will help in finding the solution for the
problem, analysing the feelings of the people etc.
1.3 Problem Definition
We live in an era of social media, people use the different types of social media
platforms to express their thoughts, opinions, etc. This system tries to do
Sentimental analysis of those posts, tweets using machine learning algorithms
and Natural Language Processing.
1.4 Objectives
Following are the three main Objectives of the project:
1. Analysis of a particular event or situation.
2. Identifying the polarity
3. Usage
1.5 Project Scope
This project will be helpful to the company's political parties as well as to the
common people. It will be helpful to political parties for reviewing about the
program that they are going to do or the program they have performed, similarly
companies also can get review about their new product on newly released
hardware or software. Also, the movie maker can take review on the currently
running movie. By analysing the Twitter analyser, we can get how positive or
negative or neutral are people about it.
1.6 Limitations
● The precision level of the prediction varies according to the
Database provided and algorithms used.
● Limited to only one social media platform.
ZCOER, Department of Computer
10
1.7 Methodologies of Problem Solving
The System for Sentiment Analysis is divided into 4 parts as follows
1. Training Data: Collecting dataset (tweets) that are done by users. Here we have a
collection of 40,000 datasets for the training purpose gathered from
GitHub, Kaggle.
2. Pre-processing: With regards to improving the performance of
analysis, the compulsion is to do pre-processing of data before exploring it.
At first the process of tokenize (splitting of sequence of a string) is
initiated, the input stream converted into separate words.
3. Feature Extraction: For making the sentiment analysis model, we
have to extract every single feature from the text data which are widely
categorized into morphological features, word N-gram features.
4. Classification Model: In our Sentiment Analysis experiment model,
machine learning algorithms are used as a classifier and trained this
classifier over the training sample.
LITERATURE SURVEY
Social Media Sentiment Analysis on Twitter Datasets
Authors: Shikha Tiwari, Anshika Verma, Peeyush Garg, Deepika Bansal
ZCOER, Department of Computer
11
Methodology: 1. Retrieve tweets
2. Pre-Processing
3. Sentiment Analysis
Sentiment Analysis of Twitter Data: A Survey of Techniques
Authors: Vishal A. Kharde, S.S. Sonawane
Methodology: 1. Supervised Learning
2. Unsupervised Learning
3. Naïve Bayes
Twitter Text Mining for Sentiment Analysis on People’s Feedback
about Oman Tourism
Authors: Vallikannu Ramanathan, T.Meyyappan
Methodology: 1. Domain specific Ontology Entity
2. Specific Opinion Extraction
3. Lexicon based Approach
Sentiment Analysis of Twitter Data
Author: Radhi D. Desai
Methodology: 1. Retrieve tweets
2. Pre-Processing
3. Sentiment Score
4. Review of sentiments
Natural Language Processing for Sentiment Analysis
Authors: Wei Yen Chong, Bhawani Selvaretnam, Lay-Ki Soon
Methodology: 1. Subjectivity Classification
2. Semantic Association
3. Polarity Classification
ZCOER, Department of Computer
12
Short Survey on Naive Bayes Algorithm
Authors: Pouria Kaviani, Mrs. Sunita Dhotre
Methodology: 1. Retrieve tweets
2. Pre-Processing
3. Semantic Association
4. Polarity Classification
SOFTWARE REQUIREMENTS SPECIFICATION
3.1 Assumptions and Dependencies:
ZCOER, Department of Computer
13
The software heavily depends on the availability and quality of the
Database used and ML Algorithm used. Better the Dataset and algorithm,
more accurate would be the Sentiment Analysis.
3.2 Functional Requirements:
Functional requirements are statement of services the system should
provide, how the system should react to particular inputs and how the
system should behave in particular situation.
● The Dataset trained by machine learning algorithms.
● Machine learns this classification algorithms.
● Classifies the text input provided according to the polarity.
3.3 Non-Functional Requirements:
Non-functional requirements define system properties and constraints it
arises through user needs, because of budget constraints or organizational
policies, or due to the external factors such as safety regulations, privacy
registration and so on.
3.3.1 Performance Requirements:
The software should be able to give the expected output quickly based
on the given input parameter quickly in this case the optimal sentiment
analysis is expected.
3.3.2 Safety Requirements:
No safety requirements needed.
3.3.3 Security Requirements:
No requirements for security.
ZCOER, Department of Computer
14
3.3.4 Software Quality Attributes:
The software should be able to give the expected output quickly based on the
given input parameter quickly in this case the optimal sentiment analysis is
expected. Also, the output should be well categorised without any ambiguity.
3.4 System Requirements:
3.4.1 Database Requirements
● Database: MySQL
3.4.2 Software Requirements (Platform choice)
● Operating System: Windows 10
● Technology: Python 3.5.4
● Front End: Mobile Application
Hardware Requirements
● Processor: At least android version 2.0
● RAM: 4 GB
● Hard Disk: 8 GB
3.5 Analysis Models: SDLC Model to be applied:
• Requirement gathering and analysis:
ZCOER, Department of Computer
15
In this step of waterfall, we identify what are the various
requirements are needed for our project such as software and hardware
required, database, and interfaces.
• System Design:
In system design phase, we design the system which is easily
understandable for end user i.e. user friendly. We design some UML
diagrams and data ow diagram to understand the system ow and
system module and sequence of execution.
• Implementation:
In implementation phase of our project, we have implemented the
various modules required for getting expected outcome at the different
module levels. With inputs from system design, the system is first
developed in small programs called units, which are integrated in the
next phase. Each unit is developed and tested for its functionality
which is referred to as Unit Testing.
• Testing:
The different test cases are performed to test whether the project
modules are giving expected outcome in assured time or not. All the
units developed in the implementation phase are integrated into a
system after testing of each unit. Post integration, the entire system is
tested for any faults and failures.
• Deployment of System:
Once the functional and non-functional testing is done, the product
is deployed in the customer environment or released into the market.
• Maintenance:
After successfully deployment of the system here all the
maintenance are Considered while managing all system functionally.
ZCOER, Department of Computer
16
● We have used the agile SDLC model to follow the project development
process
Development Process
SYSTEM DESIGN
4.1 System Architecture:
ZCOER, Department of Computer
17
The process of designing a functional classifier for sentiment analysis can be
broken down into five basic categories. They are as follows:
I. Data Acquisition
II. Feature Extraction
III. Classification
Data Acquisition:
Data in the form of raw tweets is acquired by using the python library
“tweestream” which provides a package for simple twitter streaming API. This
API allows two modes of accessing tweets: SampleStream and FilterStream.
SampleStream simply delivers a small, random sample of all the tweets
streaming at a real time. FilterStream delivers tweets which match a certain
criterion. It can filter the delivered tweets according to three criteria:
• Specific keyword(s) to track/search for in the tweets
• Specific Twitter user(s) according to their user-id’s
• Tweets originating from specific location(s) (only for geo-tagged tweets).
Feature Extraction:
Now that we have arrived at our training set, we need to extract useful
features from it which can be used in the process of classification. But first we
will discuss some text formatting techniques which will aid us in feature
extraction:
• Tokenization: It is the process of breaking a stream of text up into words,
symbols and other meaningful elements called “tokens”. Tokens can be
separated by whitespace characters and/or punctuation characters. It is done so
that we can look at tokens as individual components that make up a tweet.
ZCOER, Department of Computer
18
• Punctuation marks and digits/numerals may be removed if for example we
wish to compare the tweet to a list of English words.
• Stemming: It is the text normalizing process of reducing a derived word to its
root or stem. For example, a stemmer would reduce the phrases “stemmer”,
“stemmed”, “stemming” to the root word “stem”. Advantage of stemming is
that it makes comparison between words simpler, as we do not need to deal
with complex grammatical transformations of the word.
• Stop-words removal: Stop words are classes of some extremely common
words which hold no additional information when used in a text and are thus
claimed to be useless [19]. Examples include “a”, “an”, “the”, “he”, “she”,
“by”, “on”, etc. It is sometimes convenient to remove these words because
they hold no additional information since they are used almost equally in all
classes of text.
• Parts-of-Speech Tagging: POS-Tagging is the process of assigning a tag to
each word in the sentence as to which grammatical part of speech that word
belongs to, i.e., noun, verb, adjective, adverb, coordinating conjunction etc.
Classification:
Pattern classification is the process through which data is divided into
different classes according to some common patterns which are found in one class
which differ to some degree with the patterns found in the other classes. The
ultimate aim of our project is to design a classifier which accurately classifies
tweets in the following four sentiment classes: positive, negative, neutral and
ambiguous
ZCOER, Department of Computer
19
SYSTEM ARCHITECTURE
4.2 Data Flow Diagram:
Describing and documenting data is essential in ensuring that the researcher,
and others who may need to use the data, can make sense of the data and
understand the processes that have been followed in the collection, processing,
and analysis of the data. Research data are any physical and/or digital materials
that are collected, observed, or created in research activity for purposes of
analysis to produce original research results or creative works.
ZCOER, Department of Computer
20
System Flow Diagram
PROJECT PLAN
ZCOER, Department of Computer
21
5.1 Project Estimate
Function
Estimated LOC
User Interface Methods
600
211
69
420
87
Data Pre-processing
Developing the algorithm
Connecting the Modules
Application Program Interface
5.2 Risk Management
• Risk Identification:
Risk identification in project management is the core task within the
risk management process to describe and classify risks in the project.
By means of risk identification software tools, all the information
gathered and analysed during the identification of risks serves as a
foundation for further risk analysis, evaluation and estimation. In this
project the risk is about predicting wrong diseases with the symptoms
provided by the user.
• Risk Analysis:
Risk analysis in project management are also intended to provide
project leadership with contingency information for scheduling,
budgeting, and project control purposes, as well as provide tools to
support decision making and risk management as the project
progresses through planning and implementation. In this project
the risk of predicting wrong disease is overcome by using 3
algorithm prediction result.
• Overview of Risk Mitigation, Monitoring, Management
Risk Mitigation:
To mitigate this risk, project management must develop a strategy for
predicting correct result. The possible steps to be taken are:
ZCOER, Department of Computer
22
1.
Selecting multiple algorithms for prediction (naïve bayes, random
forest, decision tree).
2.
Mitigate those causes that are under our control before the project
starts. 3. Once the project commences, assume prediction will occur and
develop techniques to ensure maximum accuracy of prediction.
4.
Organize project teams so that information about each development
activity is widely dispersed.
5.
Define documentation standards and establish mechanisms to
ensure that documents are developed in a timely manner.
Risk Monitoring:
As the project proceeds, risk monitoring activities commence. The
project manager monitors factors that may provide an indication of
whether the risk is becoming more or less likely. In the case of
project management, the following factors can be monitored:
1. General attitude of team members based on project pressures.
2. Interpersonal relationships among team members.
3. Problem solving with the algorithms selected.
Risk Management:
Risk management and contingency planning assumes that
mitigation efforts have failed and that the risk has become a reality.
Continuing the example, the project is well underway, and there is
a problem of predicting result as accurate as possible. If the
mitigation strategy has been followed, result will be predicted
accurately, information is documented, and knowledge has been
dispersed across the team.
5.3 PROJECT SCHEDULE
Phase
Task
Description
Phase-1
Analysis
Analyse the information given in the IEEE paper.
ZCOER, Department of Computer
23
Phase-2
Literature survey
Collect raw data and elaborate on literature surveys.
Phase-3
Design
Assign the module and design the process flow
control.
Phase-4
Implementation
Implement the code for all the modules and integrate
all the modules.
Phase-5
Testing
Test the code and overall process weather the process
works properly.
Phase-6
Documentation
Prepare the document for this project with conclusion
and future enhancement.
5.4 TEAM ORGANIZATIONS
The team of 4 students have been formed to design and develop the
proposed System.
5.5 TEAM STRUCTURES
The team structure for the project is identified. Roles are defined.
5.6 MANAGEMENT REPORTING AND COMMUNICATIONS
Mechanisms for progress reporting and inter/intra team communication
are identified as per assessment sheet and lab time table.
5.7 TIMELINE CHART
Activity
Year 2020
ZCOER, Department of Computer
Year 2021
24
Literature survey
and Review,
synopsis
✔
✔
Project
architecture,
design, selection
of algorithm
Designing and
developing the
user interface for
the user
✔
✔
✔
✔
Designing
features for
project
Designing and
developing the
Django
application
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔
✔ ✔
✔
Integrating
Twitter API &
fetching tweets
Connecting the
tweets with
algorithm and
integration with
UI
✔ ✔✔
Preparation of
Research paper
and publishing
Report writing
✔
✔
✔
✔ ✔✔
✔
✔
✔✔
PROJECT IMPLEMENTATIONS
ZCOER, Department of Computer
25
6.1 OVERVIEW OF PROJECT MODULES
This project will be helpful to the company's political parties as well as to the
common people. It will be helpful to political parties for reviewing about the
program that they are going to do or the program they have performed, similarly
companies also can get review about their new product on newly released
hardware or software. Also, the movie maker can take review on the currently
running movie. By analysing the Twitter analyser, we can get how positive or
negative or neutral are people about it.
1.
User Interface Module:
User enters a search query for particular event (e.g., Covid19) and its
overall sentiment analysis for most recent tweets is displayed as
graphical representations (Pie Chart).
2.
Prediction Module:
In this module, pre-define labelled dataset is provided to the python
script to train and pre-process the module using Machine Learning
algorithms.
The module is then used to analyse the sentiment of given dataset into
three categories i.e., Positive, Negative and Neutral.
3.
Twitter Module:
This module communicates with Twitter API and fetches the tweets
depending on the search query.
4.
Django Module:
The prediction module and the Twitter module is then integrated with
Django framework along with the API which interacts with User
Interface.
6.2 TOOLS AND TECHNOLOGIES
I)
TOOLS USED:
ZCOER, Department of Computer
26
• Android Studios
• VS Code
• Google Colab
II)
TECHNOLOGIES:
• Django
• Flutter
6.3 ALGORITHM DETAILS
i)
Random Forest:
● It is an ensemble classifier using many decision trees models; it can
be used for regression as well as classification.
● A random forest is the classifier consisting of a collection of tree
structured classifiers k, where the k is independently, identically
distributed random trees and each random tree consists of the unit of
vote for classification of input.
● Random forest uses the Gini index for the classification and
determining the final class in each tree.
● The final class of each tree is aggregated and voted by the weighted
values to construct the final classifier.
● The working of random forest is, A random seed is chosen which
pulls out at a random, a collection of samples from the training
datasets while maintaining the class distribution.
ZCOER, Department of Computer
27
ii) Naïve
Bayes:
● It is used to predict the categorical class labels.
● It classifies the class data based on the training set and the values in
a classifying attribute and uses it in classifying new data. ● It is a
two-step process Model Construction and Model Usage. ● This
Bayes theorem is named after Thomas Bayes and it is a statistical
method for classification and supervised learning method.
● It can solve both categorical and continuous values attributes. ●
Bayes theorem finds the probability of an event occurring given the
probability of another event that has already occurred. Bayes
theorem is stated mathematically as the following equation.
P(A/B) = P(B|A) P(A)/P(B)
ZCOER, Department of Computer
28
iii) Logistic Regression:
• It is a Machine Learning classification algorithm that is used to
predict the probability of a categorical dependent variable.
• In logistic regression, the dependent variable is a binary variable
that contains data coded as 1 (yes, success, etc.) or 0 (no, failure,
etc.). In other words, the logistic regression model predicts P(Y=1)
as a function of X.
• Multinomial logistic regression is an extension of logistic
regression that adds native support for multi-class classification
problems.
• The multinomial logistic regression algorithm is an extension to the
logistic regression model that involves changing the loss function
to cross-entropy loss and predict probability distribution to a
multinomial probability distribution to natively support multi-class
classification problems.
ZCOER, Department of Computer
29
RESULTS
1. Django:
Sentiment analysis of top 50 tweets regarding “Django” query
Sentiment analysis of top 50 tweets regarding “Farmer’s Protest”
ZCOER, Department of Computer
30
2. Flutter Screen’s:
ZCOER, Department of Computer
31
User input for to fetch tweets
ZCOER, Department of Computer
32
sentiment analysis for event – “Farmer’s Protest”
ZCOER, Department of Computer
33
ZCOER, Department of Computer
34
sentiment analysis for event – “covid19”
sentiment analysis for event
– “cyclone”
ZCOER, Department of Computer
35
OTHER SPECIFICATIONS
8.1 Advantages
i)
The use of this information can be applied to make wiser decisions
related to the use of resources, to make improvements in organizations,
providing better products/services, and ultimately to improve the citizen
lifestyle and the human relations in order to achieve a better society.
ii)
Social media is the current environment for data collection and analysis
of sentiments of people. People can share and comment on everything, from
personal thoughts to common events or topics in society. The access to social
media also can provide more information in the form of hidden metadata.
For instance, Operating System language, device type, capture time and
geographical location.
8.2 Limitations
i)
Despite the possible positive outcomes shown, there are some
disadvantages in applying automatic analysis due to the difficulty to implement
it because of the ambiguity of natural language and also the characteristics of
the posted content.
ii)
Sentiment analysis tools can identify and analyse many pieces of text
automatically and quickly. But computer programs have problems recognizing
things like sarcasm and irony, negations, jokes, and exaggerations - the sorts of
things a person would have little trouble identifying. And failing to recognize
these can skew the results.
8.3 Applications
i)
The results from sentiment analysis help businesses understand the
conversations and discussions taking place about them, and helps them react
and take action accordingly.
They can quickly identify any negative sentiments being expressed, and turn
poor customer experiences into very good ones.
ZCOER, Department of Computer
36
ii)
By listening to and analysing comments on Facebook and Twitter, local
government departments can gauge public sentiment towards their department
and the services they provide, and use the results to improve services such as
parking and leisure facilities, local policing, and the condition of roads
iii) Universities can use sentiment analysis to analyse student feedback and
comments garnered either from their own surveys, or from online sources such
as social media. They can then use the results to identify and address any areas
of student dissatisfaction, as well as identify and build on those areas where
students are expressing positive sentiments.
ZCOER, Department of Computer
37
CONCLUSION AND FUTURE WORK
The task of sentiment analysis, especially in the domain of micro-blogging, is
still in the developing stage and far from complete. So, we propose a couple of
ideas which we feel are worth exploring in the future and may result in further
improved performance.
Right now, we have worked with only the very simplest unigram models; we
can improve those models by adding extra information like closeness of the
word with a negation word. We could specify a window prior to the word (a
window could for example be of 2 or 3 words) under consideration and the
effect of negation may be incorporated into the model if it lies within that
window. The closer the negation word is to the unigram word whose prior
polarity is to be calculated, the more it should affect the polarity. For example,
if the negation is right next to the word, it may simply reverse the polarity of
that word and farther the negation is from the word the more minimized ifs
effect should be.
Apart from this, we are currently only focusing on unigrams and the effect of
bigrams and trigrams may be explored. As reported in the literature review
section when bigrams are used along with unigrams this usually enhances
performance. However, for bigrams and trigrams to be an effective feature we
need a much more labelled data set than our meagre 9,000 tweets.
Right now, we are exploring Parts of Speech separate from the unigram models,
we can try to incorporate POS information within our unigram models in future.
So, say instead of calculating a single probability for each word like P (word |
obj) we could instead have multiple probabilities for each according to the Part
of Speech the word belongs to. For example we may have P(word | obj, verb), P
(word | obj, noun) and P (word | obj, adjective). Pang et al. used a somewhat
similar approach and claims that appending POS information for every unigram
result in no significant change in performance (with Naive Bayes performing
slightly better and SVM having a slight decrease in performance), while there is
a significant decrease in accuracy if only adjective unigrams are used as
features. However, these results are for classification of reviews and may be
verified for sentiment analysis on micro blogging websites like Twitter. One
more feature we that is worth exploring is whether the information about
relative position of word in a tweet has any effect on the performance of the
classifier. Although Pang et al. explored a similar feature and reported negative
results, their results were based on reviews which are very different from tweets
and they worked on an extremely simple model.
In this research we are focussing on general sentiment analysis. There is
potential of work in the field of sentiment analysis with partially known context.
ZCOER, Department of Computer
38
For example, we noticed that users generally use our website for specific types
of keywords which can divided into a couple of distinct classes, namely:
politics/politicians, celebrities, products/brands, sports/sportsmen,
media/movies/music. So, we can attempt to perform separate sentiment analysis
on tweets that only belong to one of these classes (i.e., the training data would
not be general but specific to one of these categories) and compare the results
we get if we apply general sentiment analysis on it instead.
ZCOER, Department of Computer
39
REFERENCES
[1]. Rosenthal, Sara, Noura Farra, and Preslav Nakov. "SemEval-2017 task 4:
Sentiment analysis in Twitter." Proceedings of the 11th International Workshop
on Semantic Evaluation (SemEval-2017). 2017
[2]. Pontiki, Maria, et al. "SemEval-2016 task 5: Aspect based sentiment
analysis." ProWorkshop on Semantic Evaluation (SemEval-2016). Association
for Computational Linguistics, 2016.
[3]. Wikipedia.com/
[4]. https://www.sciencedirect.com/
[5].https://ieeexplore.ieee.org/document/8951367
ZCOER, Department of Computer
40
Download