Uploaded by Lokeshwaran.V 422100744

Project- sentimental ML

advertisement
Sentiment analysis, also known as opinion mining, is the use of natural language processing, text analysis, and
computational linguistics to identify and extract subjective information from source materials. It is used to
determine the attitudes, opinions, and emotions of a speaker or writer with respect to some topic or the overall
contextual polarity of a document. The output of sentiment analysis is often a score or a binary value indicating the
sentiment of the text (e.g. positive, negative, neutral). Sentiment analysis is widely used in various applications
such as customer service, marketing, and political analysis. Sentiment analysis techniques can be divided into two
categories: rule-based and statistical-based. The rule-based approaches use a set of predefined rules to identify the
sentiment in text, while the statistical-based approaches use machine learning techniques to identify sentiment in
text.
Sentiment analysis is a multidisciplinary field, including psychology, sociology, natural
language processing, and machine learning.
Types of sentiment analysis
1. Binary Sentiment Analysis: This type of sentiment analysis classifies text into two categories, positive or
negative.
2. Multi-class Sentiment Analysis: This type of sentiment analysis classifies text into more than two
categories, such as positive, neutral, and negative.
3. Fine-grained Sentiment Analysis: This type of sentiment analysis classifies text into more specific
categories, such as very positive, positive, neutral, negative, and very negative.
4. Opinion Mining: This type of sentiment analysis goes beyond just identifying positive or negative
sentiment, and aims to extract subjective information from text, such as the opinion holder, target, and
context.
5. Emotion Detection: This type of sentiment analysis focuses on identifying and extracting emotions from
text, such as joy, anger, fear, and sadness.
6. Stance Detection: This type of sentiment analysis aims to classify text based on the stance or position it
takes on a particular topic, such as for, against, or neutral.
7. Irony Detection: This type of sentiment analysis aims to identify irony or sarcasm in text, which can be
difficult to detect due to the use of figurative language.
8. Subjectivity Detection: This type of sentiment analysis aims to classify text as subjective or objective.
Techniques used in sentiment analysis in rule based approach
In a rule-based approach, sentiment analysis techniques rely on predefined rules to identify the sentiment in text.
Some of the techniques commonly used in rule-based sentiment analysis include:
Dictionary-based: This approach uses a predefined list of words (dictionary) with associated sentiment scores
1
(positive, negative, neutral) to identify the sentiment in text. Words that are not in the dictionary are assigned a
neutral sentiment.
Regular Expression-based: This approach uses regular expressions to identify patterns in text that indicate
sentiment. For example, using regular expressions to identify emoticons such as :-) and :-(, which indicate
positive and negative sentiment respectively.
Syntactic-based: This approach uses the grammatical structure of the text to identify sentiment. For example,
looking for negations such as "not" or "never" to invert the sentiment of a word or phrase.
Hybrid-based: This approach combines multiple rule-based techniques to improve sentiment analysis results.
Please note that rule-based approaches have some limitations as it may have a limited coverage of the sentiment
words and idioms and are not able to capture the nuances of sentiment. Additionally, the performance of rulebased approaches depends on the quality and completeness of the predefined rules.
1.2 Applications of sentiment analysis
Sentiment analysis is the process of determining the emotional tone or attitude of a piece of text. Some
popular applications of sentiment analysis include:
Social Media Monitoring: Sentiment analysis is used to monitor social media platforms, such as Twitter
and Facebook, to gauge public opinion on a particular topic or brand.
Marketing and Public Relations: Sentiment analysis is used to track and analyze customer sentiment
towards a brand, product, or service. It can be used to identify areas where a company is excelling and
areas where they need to improve.
Customer Service: Sentiment analysis is used to monitor customer feedback and reviews to identify areas
where customers are unhappy and to respond promptly.
Politics: Sentiment analysis is used to track and analyze public opinion on political issues,
candidates, and parties.
Stock Market: Sentiment analysis is used to track and analyze news and social media sentiment
towards publicly traded companies and to predict stock market movements.
Healthcare: Sentiment analysis is used to analyze patient feedback on healthcare services and to
improve the quality of care.
Online reviews: Sentiment analysis is used to analyze customer feedback from online reviews
and ratings to improve the product or service.
Disaster response: Sentiment analysis is used to analyze public opinion and reactions to natural
disasters, such as hurricanes and earthquakes, to identify areas where aid is needed.
Machine learning algorithms suitable for Sentiment analysis
2
There are several machine learning algorithms that are commonly used in sentiment analysis,some of the
popular ones include:

Logistic Regression: This is a supervised learning algorithm that is used for classification. Logistic
regression is often used in sentiment analysis to classify text as positive, negative, or neutral.

Naive Bayes: This is a simple probabilistic classifier that is often used in sentiment analysis. It is based on
the Bayes' theorem, which states that the probability of a hypothesis (in this case, the sentiment of a given
text) is equal to the probability of the evidence (in this case, the text) given the hypothesis, multiplied by
the prior probability of the hypothesis.

Support Vector Machines (SVMs): This is a supervised learning algorithm that can be used for
classification and regression. SVMs are often used in sentiment analysis to classify text as positive,
negative, or neutral.

Decision Trees: This is a supervised learning algorithm that creates a tree-like model of decisions and
their possible consequences. Decision Trees are often used in sentiment analysis to classify text as
positive, negative, or neutral.

Random Forest: This is an ensemble method that is used for classification and regression. Random Forest
is made up of multiple decision trees, and it's often used in sentiment analysis to classify text as positive,
negative, or neutral.

Gradient Boosting: This is an ensemble method that is used for classification and regression. Gradient
Boosting is made up of multiple decision trees, and it's often used in sentiment analysis to classify text as
positive, negative, or neutral.

Neural Networks: This is a set of algorithms, modeled loosely after the human brain, that are designed to
recognize patterns. Neural networks such as Recurrent Neural Networks (RNNs), Long Short-term
Memory (LSTM), and Convolutional Neural Networks (CNNs) are often used in sentiment analysis to
identify the sentiment of a given text.

Deep Learning: This is a class of machine learning techniques that are inspired by the structure and
function of the brain. Deep learning techniques such as Recurrent Neural Networks (RNNs), Long Shortterm Memory (LSTM), and Convolutional Neural Networks (CNNs) are often used in sentiment analysis
to identify the sentiment of a given text.
These algorithms are widely used and considered to be effective in sentiment analysis, but it's important to
note that the choice of algorithm depends on the specific use case and the dataset used for training the
model.
1.1
3
1.2 Overview of different models used for sentiment analysis:
Sentiment analysis employs various models encompassing traditional statistical methods to stateof-the-art deep learning architectures. Traditional models, such as Logistic Regression, Naive
Bayes, and Support Vector Machines (SVM), rely on feature engineering and statistical techniques,
using bag-of-words or TF-IDF representations to classify sentiments. Additionally, Decision Trees
and their ensemble variants like Random Forests offer interpretable models for sentiment
classification. Moving towards embedding-based approaches, Word Embedding models like
Word2Vec, GloVe, and FastText map words into dense vectors, capturing semantic
relationships.Meanwhile, recurrent neural networks (RNNs) and their specialized form, Long
Short-Term Memory (LSTM) networks, have demonstrated effectiveness in sequence modeling
and sentiment classification by retaining sequential information. Advanced transformer-based
models, including BERT (Bidirectional Encoder Representations from Transformers), GPT
(Generative Pre-trained Transformer), and their derivatives, employ attention mechanisms and
contextual embeddings to capture intricate language patterns and nuances in sentiment-laden text,
often achieving state-of- the-art performance in sentiment analysis tasks. These models showcase
a spectrum of techniques used in sentiment analysis, ranging from traditional statistical approaches
to sophisticated deep learning architectures, each with its advantages in handling different aspects
of sentiment classification tasks.
4
5
Word Embedding Variants And Contributors
Reference
First Author and Year
Approach
[12]
Mikolov T 2013
Word2vec
[13]
Le Q, Mikolov T. 2014
Doc2vec
[20]
Pennington J 2014
GloVe
[11]
Joulin A 2016
FastText
[14]
Devlin J 2018
BERT
[19]
Matthew E 2018
ELMo
[15]
Liu Y 2019
RoBERTa
[16]
Lan Z 2019
ALBERT
[18]
Sanh V 2019
Distil BERT
[17]
Clark K 2020
ELECTRA
1.3 Purpose and scope of the Project:
The purpose of the research study outlined above is to investigate the pivotal influence of dataset
size on the performance and accuracy of diverse sentiment analysis models, particularly in the
context of customer review data. This study aims to delve into the comparative analysis of
traditional models (such as Logistic Regression with Count Vectorizer and TF-IDF Vectorizer)
against more advanced models (like Word2Vec and BERT Tokenizer) under varying dataset scales.
By systematically evaluating these models on both small-scale and large-scale customer review
datasets, the research aims to discern the impact of dataset size on the efficacy and accuracy of
sentiment analysis outcomes. Additionally, the study seeks to elucidate the challenges faced by
advanced models when operating on smaller datasets and highlight the potential improvements
witnessed as dataset sizes increase. The research intends to contribute insights into the significance
of dataset scale in shaping the performance of sentiment analysis models, offering valuable
guidance for practitioners and researchers in leveraging these models effectively based on dataset
sizes in practical applications.
6
LITERATURE REVIEW
2.1 Conventional and automated approaches
Pang and Lee emphasized the significance of unigrams as the most influential feature for
effectiveness [1]. Additionally, the subjective extractor demonstrated superior performance [2].
Furthermore, the combination of unigrams with an emoticon classifier yielded favorable results
[3]. Unigrams, when combined with bigrams and trigrams, proved effective for lengthy corpora
[4]. Features with assigned polarity outperformed unigrams [5]. Despite its strong performance
with an extensive vocabulary, the unigram feature space's high dimensionality and sparsity can
potentially affect both computational efficiency and model performance. Bhai et al (2018) proved
that TF-IDF outperforms using Deep neural network [6] Fouad et al (2018) combined Bag of words
with Lexicon, Emoticons and Part of speech (PoS) and gave better results with ensemble
classifier.[7] The Continuous Bag-of-Words (CBOW) and Skip-gram models are two popular
variants of the Word2Vec algorithm.[8] Pennington J (2014) introduced Glove and the extraction
method outperforms CBoW and SkipGram with 93.2 F1 score [9]
2.2 Hybrid approaches
Parveen (2023) proposed LTF-MICF with GARN (RNN and Attention) architecture [10] Kaur G
et al(2023) Combined Review features and aspect features with LSTM classifier [11] Lin CH et al
(2023) combined BERT and Distil BERT with BiLSTM &TCN classifier [12] Want et al (2020)
combined textual information with Sentiment diffusion [13] Stochastic word embedding technique
[14] was introduced by Hao et al (2020) O2SR, C2OR Vector representation [15] introduced by
Zhu et al (2021) Wang et al (2021) proposed Refined GloVe and Refined Word2Vec [16]
2.3 Pre-trained model approaches
DistilBERT, a more compact and efficient variant of the popular BERT (Bidirectional Encoder
Representations from Transformers) model [18]
MobileBERT was created using a novel
technique called progressive knowledge transfer. Knowledge distillation is used to transfer
information from larger models to smaller ones, helping MobileBERT retain important linguistic
7
knowledge [19] A Lite BERT for Self-supervised Learning of Language Representations"
introduces ALBERT, a more lightweight variant of the BERT (Bidirectional Encoder
Representations from Transformers) model for self-supervised learning of language
representations [20] Distilling BERT for Natural Language Understanding" introduces TinyBERT,
a compressed and distilled version of the BERT (Bidirectional Encoder Representations from
Transformers) model tailored for Natural Language Understanding (NLU) tasks [21] Pre-training
Text Encoders as Discriminators Rather than Generators" introduces the ELECTRA model, a novel
approach to pre-training text encoders for natural language processing (NLP) tasks. Unlike
traditional pre-training methods that use a masked language model (MLM) objective to predict
missing tokens [22] I-BERT: Integer-only BERT Quantization" (Kim et al., 2021) proposes a novel
quantization scheme for Transformer-based models that quantizes the entire inference with integeronly arithmetic. [23]
2.4 ER Diagram:
8
METHODOLOGY
Dataset creation
In the process of crafting this research Project, the initial step involved the creation of a dataset,
wherein YouTube comments were pinpointed as the primary data source, accessed through API(s),
and subsequently retrieved using a Python script. The collected data was meticulously stored in a
CSV format, and a preliminary data cleaning phase ensued to address concerns pertaining to
missing values, duplicates, and outliers. Following this, the dataset underwent essential
preprocessing steps, which encompassed activities such as text tokenization, text normalization,
and the removal of stop words. Furthermore, the paper delved into feature engineering, which
involved the extraction of pertinent features from the textual data, including the utilization of ngrams, TF-IDF, and word embeddings.
Manual labeling was then conducted to annotate the dataset appropriately. To mitigate any class
imbalance issues, a balancing process was implemented. In the realm of word embeddings, two
prominent techniques, namely Word2vec and BERT (Bidirectional Encoder Representations from
Transformers), were adopted. Machine learning classifier Logistic Regression is employed for the
analysis. The paper assessed classifier performance using a battery of established evaluation
metrics, encompassing accuracy, precision, recall, F1-score, and a comprehensive examination via
confusion matrix analysis.
Data Preprocessing:
Data preprocessing serves as the initial stage in natural language processing tasks. Its primary
objective is to transform unstructured text data into a structured and clean format, simplifying the
processing and understanding for machine learning algorithms. This step holds significant value
as the quality of the preprocessed data directly influences the results of subsequent classification
tasks.
Text preprocessing aims to eliminate noise, discrepancies, and extraneous information that could
potentially impede the performance of the following analyses. Preprocessing techniques used
during this study were case normalization to lower case, stopwords removal, lemmatization. These
techniques were applied after the reviews were tokenized with tokenize() method from nltk library
and each word was preprocessed individually.
9
Labeling
Evaluation metrics
There are several evaluation metrics that are commonly used for evaluating the performance of sentiment analysis
models on Tamil text, including:
1. Accuracy: This is the most commonly used metric for evaluating sentiment analysis models. It is the ratio
of the number of correctly classified instances to the total number of instances.
2. Precision: This is the ratio of true positive instances to the sum of true positive and false positive
instances. It is a measure of the model's ability to correctly identify positive instances.
3. Recall: This is the ratio of true positive instances to the sum of true positive and false negative instances.
It is a measure of the model's ability to find all the positive instances.
4. F1-Score: This is the harmonic mean of precision and recall, which balances the trade-off between
precision and recall.
1
0
5. Confusion Matrix: This matrix compares the predicted sentiment of a given text against the true
sentiment. The matrix contains four elements: true positive, true negative, false positive, and false
negative.
1
1
Table 1: Dataset description
Nature of dataset
Positive labels
Negative Labels
Restaurant Reviews
(Custom)
500
500
IMDB
3457
19981
3.1 Machine learning algorithms:
1
2
3.1.1 Logistic Regression :

Logistic regression is one of the most popular Machine Learning algorithms, which comes under
the Supervised Learning technique. It is used for predicting the categorical dependent variable
using a given set of independent variables.

Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome
must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but
instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0
and 1.
Diagram:
Logistic Function:

The sigmoid function is a mathematical function used to map the predicted values to
probabilities.

The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function or
the logistic function.
10
Assumptions for Logistic Regression:

The dependent variable must be categorical in nature.
 The independent variable should not have multi-collinearity.
Type of Logistic Regression:

Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.

Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"

Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".
3.1.2 Support Vector Machine Algorithm:

Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
11
Diagram :
Types of SVM
Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier
Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier.
3.1.3 Naive Bayes Classifier Algorithm:

Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.

It is mainly used in text classification that includes a high-dimensional training dataset.

Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.
12
Types of Naive Bayes Model:

Gaussian

Multinomial

Bernoulli
Applications of Naive Bayes Classifier:
 It is used for Credit Scoring./

It is used in medical data classification.

It can be used in real-time predictions because Naïve Bayes Classifier is an eager
learner.

It is used in Text classification such as Spam filtering and Sentiment analysis
Advantages of Naive Bayes Classifier:
 Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.

It can be used for Binary as well as Multi-class Classifications.

It performs well in Multi-class predictions as compared to the other Algorithms.

It is the most popular choice for text classification problems.
Disadvantages of Naive Bayes Classifier:

Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.
13
3.1.4 K-Nearest Neighbor(KNN) Algorithm:

K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.

K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.

It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
Advantages of KNN Algorithm:
 It is simple to implement.

It is robust to the noisy training data

It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
 Always needs to determine the value of K which may be complex some time.

The computation cost is high because of calculating the distance between the data
points for all the training samples.
14
Basic algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
Step-6: Our model is ready
15
3.1.5 Decision Tree Classification Algorithm:

Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It is
a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.

In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.

It is called a decision tree because, similar to a tree, it starts with the root node, which expands
on further branches and constructs a tree-like structure.
Decision Tree Terminologies

Root Node

Leaf Node

Splitting

Branch/Sub Tree

Branch/Sub Tree

Parent/Child node
16
Basic algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step 3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.
Advantages of the Decision Tree
 It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.

It can be very useful for solving decision-related problems.

It helps to think about all the possible outcomes for a problem.

There is less requirement of data cleaning compared to other algorithms.
17
Disadvantages of the Decision Tree
 The decision tree contains lots of layers, which makes it complex.

It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.

For more class labels, the computational complexity of the decision tree may increase.
Attribute Selection Measures

Information Gain

Gini Index
3.2 Vectorization Methods:
 Bag of Words
 TFIDF
 Word2Vec
3.2.1 Bag Of Words:

Bag of Words (BOW) is a Natural Language Processing strategy for converting a text
document into numbers that can be used by a computer program. This method
involves converting text into a vector based on the frequency of words in the text,
without considering the order or context of the words.

Let’s examine the example we gave for tokenization in our previous article with
BOW.
18

Imagine a social media platform that aims to analyze customer reviews and understand
the popularity of services among users. This platform decides to employ the Bag of
Words method for processing customer reviews.
3.2.2 TF-IDF:

Similar to count vectorization one last text handling option is TF-IDF Vectorization.

Term frequency – inverse document frequency scales word occurrences by the inverse of
their frequencies in the entire dataset instead of building the occurrence matrix on counts
alone.
19
3.2.3 Word2 Vector:

The Word2Vec model can be implemented using two architectural designs:
the Continuous Bag of Words (CBOW) Model and the Continuous Skip-Gram
Model. Both models aim to reduce the dimensionality of the data and create
dense word vectors, but they approach the problem differently.

Word2Vec is a group of machine learning architectures that can find words
with similar contexts and group them together. When we say ‘context’, it
refers to the group of words that surrounds a given word in a sentence or a
document.

This grouping, also known as word embeddings, results in a cluster of vectors
that have similar meanings. The vectors representing words with similar
meanings are positioned close to each other in this high-dimensional space.
20
3.3 Environment and Execution
The experiments were conducted on virtual machines (VMs) hosted on Azure and Google Colab
platforms The Azure platform provided a virtual machine with CPU resources, well-suited for
running resource-efficient tasks. The 28 GB of RAM facilitated handling larger datasets and
models. This platform provided cloud-based computing resources without the need for local
hardware The programming environment employed for the experiments was Python.
21
RESULTS AND DISCUSSIONS
Table 2: Performance metrics for Restaurant review dataset
Vectorizer
Count
Count (Cross validation)
TFIDF
TFIDF (Cross validation)
Word2vec
word2vec (cross validation)
BERT
BERT (cross validation)
Accuracy Precision Recall F1 score
82
86
77
82
78
79
77
78
82
88
75
81
80
83
74
78
53
54
65
59
52
52
40
45
53
55
52
53
53
53
52
52
Table 3: Performance metrics for Movie review dataset
F1
Accuracy Precision Recall score
89
88
90
88
90
88
91
90
82
82
84
83
51
51
51
51
Vectorizer
Count
TFIDF
Word2vec
BERT
Table 4: Pretrained Models as Feature Extractor
Classifier
Feature
Extractor
NaiveBayes
Accuracy
Precision
BERT
0.86
0.94
0.9
Restaurant Reviews
NaiveBayes
BERT
0.69
0.67
0.68
Finance Reviews
Support
Vector
Machine
0.94
0.82
DistilBERT
0.88
Restaurant Reviews
Random
Forest
DistilBERT
0.9
Restaurant Reviews
0.86
Recall
0.97
22
Dataset
Model
Naive Bayes
Naive Bayes
Logistic
Regression
Support
Vector
Machine
Random
Forest
Random
Forest
Word Embedding method
Accuracy
Precision
Recall
F1score
0
0.76
0.78
0.77
1
0.79
0.77
0.78
macro avg
0.77
0.78
0.77
weighted avg
0.78
0.78
0.78
0
0.73
0.77
0.75
1
0.78
0.74
0.76
macro avg
0.76
0.76
0.75
weighted avg
0.76
0.76
0.76
0
0.7
0.78
0.74
1
0.77
0.69
0.73
TF IDF
COUNT
Accuracy
GLOVE
0.73
macro avg
0.74
0.74
0.73
weighted avg
0.74
0.73
0.73
0
0.68
0.79
0.73
1
0.77
0.65
0.71
TFIDF and GLOVE
Combined
Accuracy
0.72
macro avg
0.73
0.72
0.72
weighted avg
0.73
0.72
0.72
0
0.69
0.84
0.76
1
0.71
0.61
0.69
COUNT
Accuracy
0.73
macro avg
0.74
0.73
0.73
weighted avg
0.74
0.73
0.73
0
0.71
0.84
0.77
1
0.8
0.65
0.72
TF-IDF
Accuracy
0.74
macro avg
0.75
0.74
0.74
weighted avg
0.75
0.74
0.74
23
4.1 Restaurant Review Dataset
Count Vectorizer and TF-IDF: Both Count Vectorizer and TF-IDF achieved relatively high
accuracy scores around 82%. They displayed good precision and recall scores, with TF-IDF
showing slightly better precision but lower recall.
Word2Vec and BERT: Word2Vec and BERT models exhibited lower performance on the
restaurant review dataset compared to the traditional vectorizers. These models yielded accuracy
scores around 53%, with varying precision, recall, and F1 scores. Cross-validation results also
showed consistent performance trends, with Word2Vec and BERT lagging behind CountVectorizer
and TF-IDF.
4.2 Movie Review Dataset:
Count Vectorizer and TF-IDF: Both vectorizers performed quite well on the larger movie review
dataset. Count Vectorizer achieved an accuracy of 89%, while TF-IDF slightly outperformed it
with 90% accuracy. Both showed good precision, recall, and F1 scores, indicating robust
performance in sentiment analysis for this dataset.
Word2Vec: Word2Vec demonstrated respectable performance with an accuracy of 82%, showing
balanced precision, recall, and F1 scores, albeit slightly lower than the traditional vectorizers.
Figure 1: Comparison of Accuracy scores
24
CONCLUSION
5.1 Dataset Size Impact:
The results showcase a notable impact of dataset size on model performance. Traditional
vectorizers, particularly Count Vectorizer and TF-IDF, maintained consistent and strong
performance across both datasets. Their effectiveness remained relatively stable despite the dataset
variations.
5.2 Complex Models on Small Dataset:
Advanced models like Word2Vec and BERT struggled on the smaller restaurant review dataset,
showing lower accuracy and less stable performance metrics. This highlights their dependence on
larger datasets for optimal performance, as evidenced by their better performance on the larger
movie review dataset.
5.3 Model Generalization:
The traditional vectorizers demonstrated better generalization, achieving robust performance
across different datasets. In contrast, complex models like Word2Vec and BERT might require
fine-tuning or larger datasets to exhibit their full potential and generalize effectively across diverse
domains.
25
FUTURE ENHANCEMENT
6.1 Improved accuracy through machine learning:
One major area of focus in the future of sentiment analysis will be improving the accuracy of the
technology. While sentiment analysis has come a long way in recent years, there is still room for
improvement. Machine learning algorithms can help to improve accuracy by continuously
learning from new data and refining their models over time. For example, a machine learning
model could be trained on a large dataset of customer reviews, allowing it to better understand
the nuances of human language and sentiment.
6.2 Multilingual sentiment analysis:
As businesses become increasingly global, the ability to perform sentiment analysis in multiple
languages will become more important. Multilingual sentiment analysis is already being used in
some industries, such as hospitality and travel, where it is important to understand the sentiment
of customers from different countries. In the future, we can expect to see more advanced
multilingual sentiment analysis tools that can accurately analyze sentiment across multiple
languages.
6.3 Real-time sentiment analysis:
Another area of focus for the future of sentiment analysis is real-time analysis. Real-time
sentiment analysis can be useful in a variety of contexts, such as monitoring social media during
a crisis or analyzing customer feedback in real-time. Real-time sentiment analysis can provide
businesses with valuable insights that they can use to make decisions quickly and respond to
changing circumstances.
6.4 Combining sentiment analysis with other technologies:
Finally, we can expect to see more businesses combining sentiment analysis with other
technologies to gain even deeper insights into their customers. For example, sentiment
analysis could be combined with natural language processing (NLP) to better understand the
26
meaning behind customer feedback. By combining these technologies, businesses can gain a
more complete picture of their customers and make more informed decisions.The future of
sentiment analysis is bright, with many exciting developments on the horizon. As businesses
continue to generate more data and look for ways to gain a competitive edge, sentiment analysis
will become an increasingly important tool for understanding customer sentiment and making
data-driven decisions. By embracing new technologies and techniques, businesses can stay
ahead of the curve and continue to deliver value to their customers.
27
REFERENCES
[1] Bo
Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment
Classification using Machine Learning Techniques. In Proceedings of the 2002 Conference on
Empirical Methods in Natural Language Processing (EMNLP 2002), pages 79–86. Association
for Computational Linguistics.
[2] Dave K, Lawrence S, Pennock DM. Mining the peanut gallery: Opinion extraction and
semantic classification of product reviews. In Proceedings of the 12th international conference
on World Wide Web 2003 May 20 (pp. 519-528).
[3] Go A, Bhayani R, Huang L. Twitter sentiment classification using distant supervision. CS224N
project report, Stanford. 2009 Dec;1(12):2009
[4] Bermingham A, Smeaton AF. Classifying sentiment in microblogs: is brevity an advantage? In
Proceedings of the 19th ACM international conference on Information and knowledge
management 2010 Oct 26 (pp. 1833-1836).
[5] Agarwal A, Xie B, Vovsha I, Rambow O, Passonneau RJ. Sentiment analysis of twitter data.
In Proceedings of the workshop on language in social media (LSM 2011) 2011 Jun (pp. 3038).
[6] Bhoi A, Joshi S. Various approaches to aspect-based sentiment analysis. arXiv preprint
arXiv:1805.01984. 2018 May 5.
28
[7] Fouad MM, Gharib TF, Mashat AS. Efficient twitter sentiment analysis system with feature
selection and classifier ensemble. In The International Conference on Advanced Machine
Learning Technologies and Applications (AMLTA2018) 2018 (pp. 516-527). Springer
International Publishing.
[8] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector
space. arXiv preprint arXiv:1301.3781. 2013 Jan 16.
[9] Pennington J, Socher R, Manning CD. Glove: Global vectors for word representation. In
Proceedings of the 2014 conference on empirical methods in natural language processing
(EMNLP) 2014 Oct (pp. 1532-1543).
[10] Parveen, N., Chakrabarti, P., Hung, B.T. et al. Twitter sentiment analysis using hybrid gated
attention recurrent network. J Big Data 10, 50 (2023). https://doi.org/10.1186/s40537-02300726-3
[11] Kaur, G., Sharma, A. A deep learning-based model using hybrid feature extraction approach
for consumer sentiment analysis. J Big Data 10, 5 (2023). https://doi.org/10.1186/s40537-02200680-6
[12] Lin, CH., Nuha, U. Sentiment analysis of Indonesian datasets based on a hybrid deep-learning
strategy. J Big Data 10, 88 (2023). https://doi.org/10.1186/s40537-023-00782-9
[13] Wang L, Niu J, Yu S. SentiDiff: combining textual information and sentiment diffusion
patterns for twitter sentiment analysis. IEEE Trans Knowl Data Eng. 2020;32(10):2026–39.
https:// doi. org/ 10. 1109/ tkde. 2019. 29136 41.
29
[14] Hao Y, Mu T, Hong R, Wang M, Liu X, Goulermas JY. Cross-domain sentiment encoding
through stochastic word embedding. IEEE Trans Knowl Data Eng. 2020;32(10):1909–22.
https:// doi. org/ 10. 1109/ tkde. 2019. 29133 79.
[15] Zhu L, Li W, Shi Y, Guo K. SentiVec: learning sentiment-context vector via kernel
optimization function for sentiment analysis. IEEE Trans Neural Netw Learn Syst.
2021;32(6):2561–72. https:// doi. org/ 10. 1109/ tnnls. 2020. 30065 31.
[16] Dhakal S. Combining sentiment lexicons and content-based features for depression detection.
IEEE Intell Syst. 2021; 36:99–105. https:// doi. org/ 10. 1109/ MIS. 2021. 30936 60.
[17]
Eklund M. Comparing Feature Extraction Methods and Effects of Pre-Processing Methods
for Multi-Label Classification of Textual Data.
[18] Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled version of BERT: smaller,
faster, cheaper, and lighter. arXiv preprint arXiv:1910.01108. 2019 Oct 2.
[19] Sun Z, Yu H, Song X, Liu R, Yang Y, Zhou D. Mobilebert: Task-agnostic compression of bert
by progressive knowledge transfer.
[20] Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R. Albert: A lite bert for self-
supervised learning of language representations. arXiv preprint arXiv:1909.11942. 2019 Sep
26.
30
APPENDIX
API key –Youtube generation steps
1. Log in to Google Developers Console.
2. Create a new project.
3. On the new project dashboard, click Explore & Enable APIs.
4. In the library, navigate to YouTube Data API v3 under YouTube APIs.
5. Enable the API.
6. Create a credential.
7. A screen will appear with the API key.
API Key: AIzaSyChzk3uMX8lo_dyhbE62PqtMfT5LRMT5II
8.1 Dataset Creation Code:
import pandas as pd
import numpy as np
from googleapiclient.discovery import build
video_id = "GswSg2ohqmA"
api_key = 'AIzaSyChzk3uMX8lo_dyhbE62PqtMfT5LRMT5II'
# recursive function to get all replies in a comment thread
def get_replies(comment_id, token):
replies_response = yt_object.comments().list(part = 'snippet', maxResults = 100, parentId =
comment_id, pageToken = token).execute()
for reply in replies_response['items']:
all_comments. append(reply['snippet']['textDisplay'])
if replies_response.get("nextPageToken"):
return get_replies(comment_id, replies_response['nextPageToken'])
31
else:
return []
# recursive function to get all comments
def get_comments(youtube, video_id, next_view_token):
global all_comments
# check for token
32
if len(next_view_token.strip()) == 0:
all_comments = []
if next_view_token == '':
# get the initial response
comment_list = youtube.commentThreads().list(part = 'snippet', maxResults = 100,
videoId = video_id, order = 'relevance').execute()
else:
# get the next page response
comment_list = youtube.commentThreads().list(part = 'snippet', maxResults = 100,
videoId = video_id, order='relevance', pageToken=next_view_token).execute()
# loop through all top level comments
for comment in comment_list['items']:
# add comment to list
all_comments.append([comment['snippet']['topLevelComment']['snippet']['textDisplay']])
# get number of replies
reply_count = comment['snippet']['totalReplyCount']
all_replies = []
# if replies greater than 0
if reply_count > 0:
# get first 100 replies
replies_list = youtube.comments().list(part='snippet', maxResults=100,
parentId=comment['id']).execute()
for reply in replies_list['items']:
# add reply to list
all_replies.append(reply['snippet']['textDisplay'])
# check for more replies
while "nextPageToken" in replies_list:
33
token_reply = replies_list['nextPageToken']
# get next set of 100 replies
replies_list = youtube.comments().list(part = 'snippet', maxResults = 100, parentId =
comment['id'], pageToken = token_reply).execute()
for reply in replies_list['items']:
# add reply to list
all_replies.append(reply['snippet']['textDisplay'])
# add all replies to the comment
all_comments[-1].append(all_replies)
if "nextPageToken" in comment_list:
return get_comments(youtube, video_id, comment_list['nextPageToken'])
else:
return []
all_comments = []
# build a youtube object using the api key
yt_object = build('youtube', 'v3', developerKey=api_key)
# get all comments and replies
comments = get_comments(yt_object, video_id, '')
print(all_comments)
a = np.array(all_comments)
b= np.reshape(a,-1)
34
result = pd.DataFrame(b)
result.to_excel('comments.xlsx')
Figure 2:Dataset
Before cleaning
Data Preprocessing:
35
After cleaning
36
Figure 3: Labeling
8.2 Import Statement code:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
data=pd.read_csv('/content/imdb.csv')
data.head()
x1=data['Review']
y=data['Liked']
print(data['Review'])
37
8.3 Logistic Regression With Count Vectorization:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
vectorizer = CountVectorizer()
# Fit and transform the data
xcv = vectorizer.fit_transform(x1)
x_train, x_test, y_train, y_test = train_test_split(xcv, y,test_size = 0.3,
random_state=23)
clf = LogisticRegression(random_state=0)
clf.fit(x_train, y_train)
# Prediction
y_pred = clf.predict(x_test)
accuracy = accuracy_score(y_test,y_pred)
print(f"Accuracy: {accuracy}")
8.4 Logistic Regression With TF-IDF Vectorization:
vectorizer = TfidfVectorizer()
xtfidf = vectorizer.fit_transform(x1)
x_train, x_test, y_train, y_test = train_test_split(xtfidf, y,test_size = 0.3,
random_state=23)
clf.fit(x_train, y_train)
# Prediction
predictions = clf.predict(x_test)
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
38
8.5 Testing Code:
x_train, x_test, y_train, y_test = train_test_split(xtfidf, y,test_size = 0.3,random_state=23)
scoring = ['accuracy', 'precision', 'recall', 'f1']
# Perform cross-validation
cv_results = cross_validate(clf, x_train, y_train, cv=5, scoring=scoring)
# Print the results
print("Accuracy: {:.2f}".format(cv_results['test_accuracy'].mean()))
print("Precision: {:.2f}".format(cv_results['test_precision'].mean()))
print("Recall: {:.2f}".format(cv_results['test_recall'].mean()))
print("F1 Score: {:.2f}".format(cv_results['test_f1'].mean()))
39
ABBREVIATIONS
Word
Expansion
NB
Naïve Bayes
SVM
Support Vector Machine
Max Ent
Maximum Entropy
CNN
Convolutional Neural Network
MNB
Multinomial Naïve Bayes
DT
Decision Tree
RFDT
Random Forest Decision Tree
ET
Extra Trees
LR
Linear Regression
KNN
K Nearest Neighbour
BoW
Bag Of Words
TF-IDF
Term Frequency-Inverse Document Frequency
Lex
Lexicons
C2OR
context-to-object-word reward models
Word
Expansion
Emo
Emoticons
PoS
Part of Speech
W2V
Word to Vector
GloVe
Global Vectors for Word Representation
BERT
Bidirectional Encoder Representations from Transformers
ELMo
Embeddings from Language Models
RoBERTa
Robustly Optimized BERT
ALBERT
A Lite BERT
31
0
Distil
BERT
Distilled BERT
ELECTRA
Efficiently Learning an Encoder that Classifies Token Replacements
Accurately
SST
Stanford Sentiment Treebank
LTF-MICF
Log Term Frequency-based Modified Inverse Class Frequency
TCN
Temporal Convolutional Networks
O2SR
Object to Surrounding word reward model
31
1
Download