Sentiment analysis, also known as opinion mining, is the use of natural language processing, text analysis, and computational linguistics to identify and extract subjective information from source materials. It is used to determine the attitudes, opinions, and emotions of a speaker or writer with respect to some topic or the overall contextual polarity of a document. The output of sentiment analysis is often a score or a binary value indicating the sentiment of the text (e.g. positive, negative, neutral). Sentiment analysis is widely used in various applications such as customer service, marketing, and political analysis. Sentiment analysis techniques can be divided into two categories: rule-based and statistical-based. The rule-based approaches use a set of predefined rules to identify the sentiment in text, while the statistical-based approaches use machine learning techniques to identify sentiment in text. Sentiment analysis is a multidisciplinary field, including psychology, sociology, natural language processing, and machine learning. Types of sentiment analysis 1. Binary Sentiment Analysis: This type of sentiment analysis classifies text into two categories, positive or negative. 2. Multi-class Sentiment Analysis: This type of sentiment analysis classifies text into more than two categories, such as positive, neutral, and negative. 3. Fine-grained Sentiment Analysis: This type of sentiment analysis classifies text into more specific categories, such as very positive, positive, neutral, negative, and very negative. 4. Opinion Mining: This type of sentiment analysis goes beyond just identifying positive or negative sentiment, and aims to extract subjective information from text, such as the opinion holder, target, and context. 5. Emotion Detection: This type of sentiment analysis focuses on identifying and extracting emotions from text, such as joy, anger, fear, and sadness. 6. Stance Detection: This type of sentiment analysis aims to classify text based on the stance or position it takes on a particular topic, such as for, against, or neutral. 7. Irony Detection: This type of sentiment analysis aims to identify irony or sarcasm in text, which can be difficult to detect due to the use of figurative language. 8. Subjectivity Detection: This type of sentiment analysis aims to classify text as subjective or objective. Techniques used in sentiment analysis in rule based approach In a rule-based approach, sentiment analysis techniques rely on predefined rules to identify the sentiment in text. Some of the techniques commonly used in rule-based sentiment analysis include: Dictionary-based: This approach uses a predefined list of words (dictionary) with associated sentiment scores 1 (positive, negative, neutral) to identify the sentiment in text. Words that are not in the dictionary are assigned a neutral sentiment. Regular Expression-based: This approach uses regular expressions to identify patterns in text that indicate sentiment. For example, using regular expressions to identify emoticons such as :-) and :-(, which indicate positive and negative sentiment respectively. Syntactic-based: This approach uses the grammatical structure of the text to identify sentiment. For example, looking for negations such as "not" or "never" to invert the sentiment of a word or phrase. Hybrid-based: This approach combines multiple rule-based techniques to improve sentiment analysis results. Please note that rule-based approaches have some limitations as it may have a limited coverage of the sentiment words and idioms and are not able to capture the nuances of sentiment. Additionally, the performance of rulebased approaches depends on the quality and completeness of the predefined rules. 1.2 Applications of sentiment analysis Sentiment analysis is the process of determining the emotional tone or attitude of a piece of text. Some popular applications of sentiment analysis include: Social Media Monitoring: Sentiment analysis is used to monitor social media platforms, such as Twitter and Facebook, to gauge public opinion on a particular topic or brand. Marketing and Public Relations: Sentiment analysis is used to track and analyze customer sentiment towards a brand, product, or service. It can be used to identify areas where a company is excelling and areas where they need to improve. Customer Service: Sentiment analysis is used to monitor customer feedback and reviews to identify areas where customers are unhappy and to respond promptly. Politics: Sentiment analysis is used to track and analyze public opinion on political issues, candidates, and parties. Stock Market: Sentiment analysis is used to track and analyze news and social media sentiment towards publicly traded companies and to predict stock market movements. Healthcare: Sentiment analysis is used to analyze patient feedback on healthcare services and to improve the quality of care. Online reviews: Sentiment analysis is used to analyze customer feedback from online reviews and ratings to improve the product or service. Disaster response: Sentiment analysis is used to analyze public opinion and reactions to natural disasters, such as hurricanes and earthquakes, to identify areas where aid is needed. Machine learning algorithms suitable for Sentiment analysis 2 There are several machine learning algorithms that are commonly used in sentiment analysis,some of the popular ones include: Logistic Regression: This is a supervised learning algorithm that is used for classification. Logistic regression is often used in sentiment analysis to classify text as positive, negative, or neutral. Naive Bayes: This is a simple probabilistic classifier that is often used in sentiment analysis. It is based on the Bayes' theorem, which states that the probability of a hypothesis (in this case, the sentiment of a given text) is equal to the probability of the evidence (in this case, the text) given the hypothesis, multiplied by the prior probability of the hypothesis. Support Vector Machines (SVMs): This is a supervised learning algorithm that can be used for classification and regression. SVMs are often used in sentiment analysis to classify text as positive, negative, or neutral. Decision Trees: This is a supervised learning algorithm that creates a tree-like model of decisions and their possible consequences. Decision Trees are often used in sentiment analysis to classify text as positive, negative, or neutral. Random Forest: This is an ensemble method that is used for classification and regression. Random Forest is made up of multiple decision trees, and it's often used in sentiment analysis to classify text as positive, negative, or neutral. Gradient Boosting: This is an ensemble method that is used for classification and regression. Gradient Boosting is made up of multiple decision trees, and it's often used in sentiment analysis to classify text as positive, negative, or neutral. Neural Networks: This is a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. Neural networks such as Recurrent Neural Networks (RNNs), Long Short-term Memory (LSTM), and Convolutional Neural Networks (CNNs) are often used in sentiment analysis to identify the sentiment of a given text. Deep Learning: This is a class of machine learning techniques that are inspired by the structure and function of the brain. Deep learning techniques such as Recurrent Neural Networks (RNNs), Long Shortterm Memory (LSTM), and Convolutional Neural Networks (CNNs) are often used in sentiment analysis to identify the sentiment of a given text. These algorithms are widely used and considered to be effective in sentiment analysis, but it's important to note that the choice of algorithm depends on the specific use case and the dataset used for training the model. 1.1 3 1.2 Overview of different models used for sentiment analysis: Sentiment analysis employs various models encompassing traditional statistical methods to stateof-the-art deep learning architectures. Traditional models, such as Logistic Regression, Naive Bayes, and Support Vector Machines (SVM), rely on feature engineering and statistical techniques, using bag-of-words or TF-IDF representations to classify sentiments. Additionally, Decision Trees and their ensemble variants like Random Forests offer interpretable models for sentiment classification. Moving towards embedding-based approaches, Word Embedding models like Word2Vec, GloVe, and FastText map words into dense vectors, capturing semantic relationships.Meanwhile, recurrent neural networks (RNNs) and their specialized form, Long Short-Term Memory (LSTM) networks, have demonstrated effectiveness in sequence modeling and sentiment classification by retaining sequential information. Advanced transformer-based models, including BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), and their derivatives, employ attention mechanisms and contextual embeddings to capture intricate language patterns and nuances in sentiment-laden text, often achieving state-of- the-art performance in sentiment analysis tasks. These models showcase a spectrum of techniques used in sentiment analysis, ranging from traditional statistical approaches to sophisticated deep learning architectures, each with its advantages in handling different aspects of sentiment classification tasks. 4 5 Word Embedding Variants And Contributors Reference First Author and Year Approach [12] Mikolov T 2013 Word2vec [13] Le Q, Mikolov T. 2014 Doc2vec [20] Pennington J 2014 GloVe [11] Joulin A 2016 FastText [14] Devlin J 2018 BERT [19] Matthew E 2018 ELMo [15] Liu Y 2019 RoBERTa [16] Lan Z 2019 ALBERT [18] Sanh V 2019 Distil BERT [17] Clark K 2020 ELECTRA 1.3 Purpose and scope of the Project: The purpose of the research study outlined above is to investigate the pivotal influence of dataset size on the performance and accuracy of diverse sentiment analysis models, particularly in the context of customer review data. This study aims to delve into the comparative analysis of traditional models (such as Logistic Regression with Count Vectorizer and TF-IDF Vectorizer) against more advanced models (like Word2Vec and BERT Tokenizer) under varying dataset scales. By systematically evaluating these models on both small-scale and large-scale customer review datasets, the research aims to discern the impact of dataset size on the efficacy and accuracy of sentiment analysis outcomes. Additionally, the study seeks to elucidate the challenges faced by advanced models when operating on smaller datasets and highlight the potential improvements witnessed as dataset sizes increase. The research intends to contribute insights into the significance of dataset scale in shaping the performance of sentiment analysis models, offering valuable guidance for practitioners and researchers in leveraging these models effectively based on dataset sizes in practical applications. 6 LITERATURE REVIEW 2.1 Conventional and automated approaches Pang and Lee emphasized the significance of unigrams as the most influential feature for effectiveness [1]. Additionally, the subjective extractor demonstrated superior performance [2]. Furthermore, the combination of unigrams with an emoticon classifier yielded favorable results [3]. Unigrams, when combined with bigrams and trigrams, proved effective for lengthy corpora [4]. Features with assigned polarity outperformed unigrams [5]. Despite its strong performance with an extensive vocabulary, the unigram feature space's high dimensionality and sparsity can potentially affect both computational efficiency and model performance. Bhai et al (2018) proved that TF-IDF outperforms using Deep neural network [6] Fouad et al (2018) combined Bag of words with Lexicon, Emoticons and Part of speech (PoS) and gave better results with ensemble classifier.[7] The Continuous Bag-of-Words (CBOW) and Skip-gram models are two popular variants of the Word2Vec algorithm.[8] Pennington J (2014) introduced Glove and the extraction method outperforms CBoW and SkipGram with 93.2 F1 score [9] 2.2 Hybrid approaches Parveen (2023) proposed LTF-MICF with GARN (RNN and Attention) architecture [10] Kaur G et al(2023) Combined Review features and aspect features with LSTM classifier [11] Lin CH et al (2023) combined BERT and Distil BERT with BiLSTM &TCN classifier [12] Want et al (2020) combined textual information with Sentiment diffusion [13] Stochastic word embedding technique [14] was introduced by Hao et al (2020) O2SR, C2OR Vector representation [15] introduced by Zhu et al (2021) Wang et al (2021) proposed Refined GloVe and Refined Word2Vec [16] 2.3 Pre-trained model approaches DistilBERT, a more compact and efficient variant of the popular BERT (Bidirectional Encoder Representations from Transformers) model [18] MobileBERT was created using a novel technique called progressive knowledge transfer. Knowledge distillation is used to transfer information from larger models to smaller ones, helping MobileBERT retain important linguistic 7 knowledge [19] A Lite BERT for Self-supervised Learning of Language Representations" introduces ALBERT, a more lightweight variant of the BERT (Bidirectional Encoder Representations from Transformers) model for self-supervised learning of language representations [20] Distilling BERT for Natural Language Understanding" introduces TinyBERT, a compressed and distilled version of the BERT (Bidirectional Encoder Representations from Transformers) model tailored for Natural Language Understanding (NLU) tasks [21] Pre-training Text Encoders as Discriminators Rather than Generators" introduces the ELECTRA model, a novel approach to pre-training text encoders for natural language processing (NLP) tasks. Unlike traditional pre-training methods that use a masked language model (MLM) objective to predict missing tokens [22] I-BERT: Integer-only BERT Quantization" (Kim et al., 2021) proposes a novel quantization scheme for Transformer-based models that quantizes the entire inference with integeronly arithmetic. [23] 2.4 ER Diagram: 8 METHODOLOGY Dataset creation In the process of crafting this research Project, the initial step involved the creation of a dataset, wherein YouTube comments were pinpointed as the primary data source, accessed through API(s), and subsequently retrieved using a Python script. The collected data was meticulously stored in a CSV format, and a preliminary data cleaning phase ensued to address concerns pertaining to missing values, duplicates, and outliers. Following this, the dataset underwent essential preprocessing steps, which encompassed activities such as text tokenization, text normalization, and the removal of stop words. Furthermore, the paper delved into feature engineering, which involved the extraction of pertinent features from the textual data, including the utilization of ngrams, TF-IDF, and word embeddings. Manual labeling was then conducted to annotate the dataset appropriately. To mitigate any class imbalance issues, a balancing process was implemented. In the realm of word embeddings, two prominent techniques, namely Word2vec and BERT (Bidirectional Encoder Representations from Transformers), were adopted. Machine learning classifier Logistic Regression is employed for the analysis. The paper assessed classifier performance using a battery of established evaluation metrics, encompassing accuracy, precision, recall, F1-score, and a comprehensive examination via confusion matrix analysis. Data Preprocessing: Data preprocessing serves as the initial stage in natural language processing tasks. Its primary objective is to transform unstructured text data into a structured and clean format, simplifying the processing and understanding for machine learning algorithms. This step holds significant value as the quality of the preprocessed data directly influences the results of subsequent classification tasks. Text preprocessing aims to eliminate noise, discrepancies, and extraneous information that could potentially impede the performance of the following analyses. Preprocessing techniques used during this study were case normalization to lower case, stopwords removal, lemmatization. These techniques were applied after the reviews were tokenized with tokenize() method from nltk library and each word was preprocessed individually. 9 Labeling Evaluation metrics There are several evaluation metrics that are commonly used for evaluating the performance of sentiment analysis models on Tamil text, including: 1. Accuracy: This is the most commonly used metric for evaluating sentiment analysis models. It is the ratio of the number of correctly classified instances to the total number of instances. 2. Precision: This is the ratio of true positive instances to the sum of true positive and false positive instances. It is a measure of the model's ability to correctly identify positive instances. 3. Recall: This is the ratio of true positive instances to the sum of true positive and false negative instances. It is a measure of the model's ability to find all the positive instances. 4. F1-Score: This is the harmonic mean of precision and recall, which balances the trade-off between precision and recall. 1 0 5. Confusion Matrix: This matrix compares the predicted sentiment of a given text against the true sentiment. The matrix contains four elements: true positive, true negative, false positive, and false negative. 1 1 Table 1: Dataset description Nature of dataset Positive labels Negative Labels Restaurant Reviews (Custom) 500 500 IMDB 3457 19981 3.1 Machine learning algorithms: 1 2 3.1.1 Logistic Regression : Logistic regression is one of the most popular Machine Learning algorithms, which comes under the Supervised Learning technique. It is used for predicting the categorical dependent variable using a given set of independent variables. Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1. Diagram: Logistic Function: The sigmoid function is a mathematical function used to map the predicted values to probabilities. The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function or the logistic function. 10 Assumptions for Logistic Regression: The dependent variable must be categorical in nature. The independent variable should not have multi-collinearity. Type of Logistic Regression: Binomial: In binomial Logistic regression, there can be only two possible types of the dependent variables, such as 0 or 1, Pass or Fail, etc. Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of the dependent variable, such as "cat", "dogs", or "sheep" Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent variables, such as "low", "Medium", or "High". 3.1.2 Support Vector Machine Algorithm: Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine Learning. The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane. SVM algorithm can be used for Face detection, image classification, text categorization, etc. 11 Diagram : Types of SVM Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into two classes by using a single straight line, then such data is termed as linearly separable data, and classifier is used called as Linear SVM classifier Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot be classified by using a straight line, then such data is termed as non-linear data and classifier used is called as Non-linear SVM classifier. 3.1.3 Naive Bayes Classifier Algorithm: Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems. It is mainly used in text classification that includes a high-dimensional training dataset. Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and classifying articles. 12 Types of Naive Bayes Model: Gaussian Multinomial Bernoulli Applications of Naive Bayes Classifier: It is used for Credit Scoring./ It is used in medical data classification. It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner. It is used in Text classification such as Spam filtering and Sentiment analysis Advantages of Naive Bayes Classifier: Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets. It can be used for Binary as well as Multi-class Classifications. It performs well in Multi-class predictions as compared to the other Algorithms. It is the most popular choice for text classification problems. Disadvantages of Naive Bayes Classifier: Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship between features. 13 3.1.4 K-Nearest Neighbor(KNN) Algorithm: K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories. K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data. It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset. Advantages of KNN Algorithm: It is simple to implement. It is robust to the noisy training data It can be more effective if the training data is large. Disadvantages of KNN Algorithm: Always needs to determine the value of K which may be complex some time. The computation cost is high because of calculating the distance between the data points for all the training samples. 14 Basic algorithm: Step-1: Select the number K of the neighbors Step-2: Calculate the Euclidean distance of K number of neighbors Step-3: Take the K nearest neighbors as per the calculated Euclidean distance. Step-4: Among these k neighbors, count the number of the data points in each category. Step-5: Assign the new data points to that category for which the number of the neighbor is maximum. Step-6: Our model is ready 15 3.1.5 Decision Tree Classification Algorithm: Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome. In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions and do not contain any further branches. It is called a decision tree because, similar to a tree, it starts with the root node, which expands on further branches and constructs a tree-like structure. Decision Tree Terminologies Root Node Leaf Node Splitting Branch/Sub Tree Branch/Sub Tree Parent/Child node 16 Basic algorithm: Step-1: Begin the tree with the root node, says S, which contains the complete dataset. Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM). Step-3: Divide the S into subsets that contains possible values for the best attributes. Step-4: Generate the decision tree node, which contains the best attribute. Step-5: Recursively make new decision trees using the subsets of the dataset created in step 3. Continue this process until a stage is reached where you cannot further classify the nodes and called the final node as a leaf node. Advantages of the Decision Tree It is simple to understand as it follows the same process which a human follow while making any decision in real-life. It can be very useful for solving decision-related problems. It helps to think about all the possible outcomes for a problem. There is less requirement of data cleaning compared to other algorithms. 17 Disadvantages of the Decision Tree The decision tree contains lots of layers, which makes it complex. It may have an overfitting issue, which can be resolved using the Random Forest algorithm. For more class labels, the computational complexity of the decision tree may increase. Attribute Selection Measures Information Gain Gini Index 3.2 Vectorization Methods: Bag of Words TFIDF Word2Vec 3.2.1 Bag Of Words: Bag of Words (BOW) is a Natural Language Processing strategy for converting a text document into numbers that can be used by a computer program. This method involves converting text into a vector based on the frequency of words in the text, without considering the order or context of the words. Let’s examine the example we gave for tokenization in our previous article with BOW. 18 Imagine a social media platform that aims to analyze customer reviews and understand the popularity of services among users. This platform decides to employ the Bag of Words method for processing customer reviews. 3.2.2 TF-IDF: Similar to count vectorization one last text handling option is TF-IDF Vectorization. Term frequency – inverse document frequency scales word occurrences by the inverse of their frequencies in the entire dataset instead of building the occurrence matrix on counts alone. 19 3.2.3 Word2 Vector: The Word2Vec model can be implemented using two architectural designs: the Continuous Bag of Words (CBOW) Model and the Continuous Skip-Gram Model. Both models aim to reduce the dimensionality of the data and create dense word vectors, but they approach the problem differently. Word2Vec is a group of machine learning architectures that can find words with similar contexts and group them together. When we say ‘context’, it refers to the group of words that surrounds a given word in a sentence or a document. This grouping, also known as word embeddings, results in a cluster of vectors that have similar meanings. The vectors representing words with similar meanings are positioned close to each other in this high-dimensional space. 20 3.3 Environment and Execution The experiments were conducted on virtual machines (VMs) hosted on Azure and Google Colab platforms The Azure platform provided a virtual machine with CPU resources, well-suited for running resource-efficient tasks. The 28 GB of RAM facilitated handling larger datasets and models. This platform provided cloud-based computing resources without the need for local hardware The programming environment employed for the experiments was Python. 21 RESULTS AND DISCUSSIONS Table 2: Performance metrics for Restaurant review dataset Vectorizer Count Count (Cross validation) TFIDF TFIDF (Cross validation) Word2vec word2vec (cross validation) BERT BERT (cross validation) Accuracy Precision Recall F1 score 82 86 77 82 78 79 77 78 82 88 75 81 80 83 74 78 53 54 65 59 52 52 40 45 53 55 52 53 53 53 52 52 Table 3: Performance metrics for Movie review dataset F1 Accuracy Precision Recall score 89 88 90 88 90 88 91 90 82 82 84 83 51 51 51 51 Vectorizer Count TFIDF Word2vec BERT Table 4: Pretrained Models as Feature Extractor Classifier Feature Extractor NaiveBayes Accuracy Precision BERT 0.86 0.94 0.9 Restaurant Reviews NaiveBayes BERT 0.69 0.67 0.68 Finance Reviews Support Vector Machine 0.94 0.82 DistilBERT 0.88 Restaurant Reviews Random Forest DistilBERT 0.9 Restaurant Reviews 0.86 Recall 0.97 22 Dataset Model Naive Bayes Naive Bayes Logistic Regression Support Vector Machine Random Forest Random Forest Word Embedding method Accuracy Precision Recall F1score 0 0.76 0.78 0.77 1 0.79 0.77 0.78 macro avg 0.77 0.78 0.77 weighted avg 0.78 0.78 0.78 0 0.73 0.77 0.75 1 0.78 0.74 0.76 macro avg 0.76 0.76 0.75 weighted avg 0.76 0.76 0.76 0 0.7 0.78 0.74 1 0.77 0.69 0.73 TF IDF COUNT Accuracy GLOVE 0.73 macro avg 0.74 0.74 0.73 weighted avg 0.74 0.73 0.73 0 0.68 0.79 0.73 1 0.77 0.65 0.71 TFIDF and GLOVE Combined Accuracy 0.72 macro avg 0.73 0.72 0.72 weighted avg 0.73 0.72 0.72 0 0.69 0.84 0.76 1 0.71 0.61 0.69 COUNT Accuracy 0.73 macro avg 0.74 0.73 0.73 weighted avg 0.74 0.73 0.73 0 0.71 0.84 0.77 1 0.8 0.65 0.72 TF-IDF Accuracy 0.74 macro avg 0.75 0.74 0.74 weighted avg 0.75 0.74 0.74 23 4.1 Restaurant Review Dataset Count Vectorizer and TF-IDF: Both Count Vectorizer and TF-IDF achieved relatively high accuracy scores around 82%. They displayed good precision and recall scores, with TF-IDF showing slightly better precision but lower recall. Word2Vec and BERT: Word2Vec and BERT models exhibited lower performance on the restaurant review dataset compared to the traditional vectorizers. These models yielded accuracy scores around 53%, with varying precision, recall, and F1 scores. Cross-validation results also showed consistent performance trends, with Word2Vec and BERT lagging behind CountVectorizer and TF-IDF. 4.2 Movie Review Dataset: Count Vectorizer and TF-IDF: Both vectorizers performed quite well on the larger movie review dataset. Count Vectorizer achieved an accuracy of 89%, while TF-IDF slightly outperformed it with 90% accuracy. Both showed good precision, recall, and F1 scores, indicating robust performance in sentiment analysis for this dataset. Word2Vec: Word2Vec demonstrated respectable performance with an accuracy of 82%, showing balanced precision, recall, and F1 scores, albeit slightly lower than the traditional vectorizers. Figure 1: Comparison of Accuracy scores 24 CONCLUSION 5.1 Dataset Size Impact: The results showcase a notable impact of dataset size on model performance. Traditional vectorizers, particularly Count Vectorizer and TF-IDF, maintained consistent and strong performance across both datasets. Their effectiveness remained relatively stable despite the dataset variations. 5.2 Complex Models on Small Dataset: Advanced models like Word2Vec and BERT struggled on the smaller restaurant review dataset, showing lower accuracy and less stable performance metrics. This highlights their dependence on larger datasets for optimal performance, as evidenced by their better performance on the larger movie review dataset. 5.3 Model Generalization: The traditional vectorizers demonstrated better generalization, achieving robust performance across different datasets. In contrast, complex models like Word2Vec and BERT might require fine-tuning or larger datasets to exhibit their full potential and generalize effectively across diverse domains. 25 FUTURE ENHANCEMENT 6.1 Improved accuracy through machine learning: One major area of focus in the future of sentiment analysis will be improving the accuracy of the technology. While sentiment analysis has come a long way in recent years, there is still room for improvement. Machine learning algorithms can help to improve accuracy by continuously learning from new data and refining their models over time. For example, a machine learning model could be trained on a large dataset of customer reviews, allowing it to better understand the nuances of human language and sentiment. 6.2 Multilingual sentiment analysis: As businesses become increasingly global, the ability to perform sentiment analysis in multiple languages will become more important. Multilingual sentiment analysis is already being used in some industries, such as hospitality and travel, where it is important to understand the sentiment of customers from different countries. In the future, we can expect to see more advanced multilingual sentiment analysis tools that can accurately analyze sentiment across multiple languages. 6.3 Real-time sentiment analysis: Another area of focus for the future of sentiment analysis is real-time analysis. Real-time sentiment analysis can be useful in a variety of contexts, such as monitoring social media during a crisis or analyzing customer feedback in real-time. Real-time sentiment analysis can provide businesses with valuable insights that they can use to make decisions quickly and respond to changing circumstances. 6.4 Combining sentiment analysis with other technologies: Finally, we can expect to see more businesses combining sentiment analysis with other technologies to gain even deeper insights into their customers. For example, sentiment analysis could be combined with natural language processing (NLP) to better understand the 26 meaning behind customer feedback. By combining these technologies, businesses can gain a more complete picture of their customers and make more informed decisions.The future of sentiment analysis is bright, with many exciting developments on the horizon. As businesses continue to generate more data and look for ways to gain a competitive edge, sentiment analysis will become an increasingly important tool for understanding customer sentiment and making data-driven decisions. By embracing new technologies and techniques, businesses can stay ahead of the curve and continue to deliver value to their customers. 27 REFERENCES [1] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pages 79–86. Association for Computational Linguistics. [2] Dave K, Lawrence S, Pennock DM. Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. In Proceedings of the 12th international conference on World Wide Web 2003 May 20 (pp. 519-528). [3] Go A, Bhayani R, Huang L. Twitter sentiment classification using distant supervision. CS224N project report, Stanford. 2009 Dec;1(12):2009 [4] Bermingham A, Smeaton AF. Classifying sentiment in microblogs: is brevity an advantage? In Proceedings of the 19th ACM international conference on Information and knowledge management 2010 Oct 26 (pp. 1833-1836). [5] Agarwal A, Xie B, Vovsha I, Rambow O, Passonneau RJ. Sentiment analysis of twitter data. In Proceedings of the workshop on language in social media (LSM 2011) 2011 Jun (pp. 3038). [6] Bhoi A, Joshi S. Various approaches to aspect-based sentiment analysis. arXiv preprint arXiv:1805.01984. 2018 May 5. 28 [7] Fouad MM, Gharib TF, Mashat AS. Efficient twitter sentiment analysis system with feature selection and classifier ensemble. In The International Conference on Advanced Machine Learning Technologies and Applications (AMLTA2018) 2018 (pp. 516-527). Springer International Publishing. [8] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 2013 Jan 16. [9] Pennington J, Socher R, Manning CD. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) 2014 Oct (pp. 1532-1543). [10] Parveen, N., Chakrabarti, P., Hung, B.T. et al. Twitter sentiment analysis using hybrid gated attention recurrent network. J Big Data 10, 50 (2023). https://doi.org/10.1186/s40537-02300726-3 [11] Kaur, G., Sharma, A. A deep learning-based model using hybrid feature extraction approach for consumer sentiment analysis. J Big Data 10, 5 (2023). https://doi.org/10.1186/s40537-02200680-6 [12] Lin, CH., Nuha, U. Sentiment analysis of Indonesian datasets based on a hybrid deep-learning strategy. J Big Data 10, 88 (2023). https://doi.org/10.1186/s40537-023-00782-9 [13] Wang L, Niu J, Yu S. SentiDiff: combining textual information and sentiment diffusion patterns for twitter sentiment analysis. IEEE Trans Knowl Data Eng. 2020;32(10):2026–39. https:// doi. org/ 10. 1109/ tkde. 2019. 29136 41. 29 [14] Hao Y, Mu T, Hong R, Wang M, Liu X, Goulermas JY. Cross-domain sentiment encoding through stochastic word embedding. IEEE Trans Knowl Data Eng. 2020;32(10):1909–22. https:// doi. org/ 10. 1109/ tkde. 2019. 29133 79. [15] Zhu L, Li W, Shi Y, Guo K. SentiVec: learning sentiment-context vector via kernel optimization function for sentiment analysis. IEEE Trans Neural Netw Learn Syst. 2021;32(6):2561–72. https:// doi. org/ 10. 1109/ tnnls. 2020. 30065 31. [16] Dhakal S. Combining sentiment lexicons and content-based features for depression detection. IEEE Intell Syst. 2021; 36:99–105. https:// doi. org/ 10. 1109/ MIS. 2021. 30936 60. [17] Eklund M. Comparing Feature Extraction Methods and Effects of Pre-Processing Methods for Multi-Label Classification of Textual Data. [18] Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper, and lighter. arXiv preprint arXiv:1910.01108. 2019 Oct 2. [19] Sun Z, Yu H, Song X, Liu R, Yang Y, Zhou D. Mobilebert: Task-agnostic compression of bert by progressive knowledge transfer. [20] Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R. Albert: A lite bert for self- supervised learning of language representations. arXiv preprint arXiv:1909.11942. 2019 Sep 26. 30 APPENDIX API key –Youtube generation steps 1. Log in to Google Developers Console. 2. Create a new project. 3. On the new project dashboard, click Explore & Enable APIs. 4. In the library, navigate to YouTube Data API v3 under YouTube APIs. 5. Enable the API. 6. Create a credential. 7. A screen will appear with the API key. API Key: AIzaSyChzk3uMX8lo_dyhbE62PqtMfT5LRMT5II 8.1 Dataset Creation Code: import pandas as pd import numpy as np from googleapiclient.discovery import build video_id = "GswSg2ohqmA" api_key = 'AIzaSyChzk3uMX8lo_dyhbE62PqtMfT5LRMT5II' # recursive function to get all replies in a comment thread def get_replies(comment_id, token): replies_response = yt_object.comments().list(part = 'snippet', maxResults = 100, parentId = comment_id, pageToken = token).execute() for reply in replies_response['items']: all_comments. append(reply['snippet']['textDisplay']) if replies_response.get("nextPageToken"): return get_replies(comment_id, replies_response['nextPageToken']) 31 else: return [] # recursive function to get all comments def get_comments(youtube, video_id, next_view_token): global all_comments # check for token 32 if len(next_view_token.strip()) == 0: all_comments = [] if next_view_token == '': # get the initial response comment_list = youtube.commentThreads().list(part = 'snippet', maxResults = 100, videoId = video_id, order = 'relevance').execute() else: # get the next page response comment_list = youtube.commentThreads().list(part = 'snippet', maxResults = 100, videoId = video_id, order='relevance', pageToken=next_view_token).execute() # loop through all top level comments for comment in comment_list['items']: # add comment to list all_comments.append([comment['snippet']['topLevelComment']['snippet']['textDisplay']]) # get number of replies reply_count = comment['snippet']['totalReplyCount'] all_replies = [] # if replies greater than 0 if reply_count > 0: # get first 100 replies replies_list = youtube.comments().list(part='snippet', maxResults=100, parentId=comment['id']).execute() for reply in replies_list['items']: # add reply to list all_replies.append(reply['snippet']['textDisplay']) # check for more replies while "nextPageToken" in replies_list: 33 token_reply = replies_list['nextPageToken'] # get next set of 100 replies replies_list = youtube.comments().list(part = 'snippet', maxResults = 100, parentId = comment['id'], pageToken = token_reply).execute() for reply in replies_list['items']: # add reply to list all_replies.append(reply['snippet']['textDisplay']) # add all replies to the comment all_comments[-1].append(all_replies) if "nextPageToken" in comment_list: return get_comments(youtube, video_id, comment_list['nextPageToken']) else: return [] all_comments = [] # build a youtube object using the api key yt_object = build('youtube', 'v3', developerKey=api_key) # get all comments and replies comments = get_comments(yt_object, video_id, '') print(all_comments) a = np.array(all_comments) b= np.reshape(a,-1) 34 result = pd.DataFrame(b) result.to_excel('comments.xlsx') Figure 2:Dataset Before cleaning Data Preprocessing: 35 After cleaning 36 Figure 3: Labeling 8.2 Import Statement code: import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.metrics import precision_score from sklearn.metrics import recall_score from sklearn.metrics import f1_score from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC data=pd.read_csv('/content/imdb.csv') data.head() x1=data['Review'] y=data['Liked'] print(data['Review']) 37 8.3 Logistic Regression With Count Vectorization: from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer vectorizer = CountVectorizer() # Fit and transform the data xcv = vectorizer.fit_transform(x1) x_train, x_test, y_train, y_test = train_test_split(xcv, y,test_size = 0.3, random_state=23) clf = LogisticRegression(random_state=0) clf.fit(x_train, y_train) # Prediction y_pred = clf.predict(x_test) accuracy = accuracy_score(y_test,y_pred) print(f"Accuracy: {accuracy}") 8.4 Logistic Regression With TF-IDF Vectorization: vectorizer = TfidfVectorizer() xtfidf = vectorizer.fit_transform(x1) x_train, x_test, y_train, y_test = train_test_split(xtfidf, y,test_size = 0.3, random_state=23) clf.fit(x_train, y_train) # Prediction predictions = clf.predict(x_test) accuracy = accuracy_score(y_test, predictions) precision = precision_score(y_test, predictions) recall = recall_score(y_test, predictions) f1 = f1_score(y_test, predictions) print(f"Accuracy: {accuracy}") print(f"Precision: {precision}") print(f"Recall: {recall}") print(f"F1 Score: {f1}") 38 8.5 Testing Code: x_train, x_test, y_train, y_test = train_test_split(xtfidf, y,test_size = 0.3,random_state=23) scoring = ['accuracy', 'precision', 'recall', 'f1'] # Perform cross-validation cv_results = cross_validate(clf, x_train, y_train, cv=5, scoring=scoring) # Print the results print("Accuracy: {:.2f}".format(cv_results['test_accuracy'].mean())) print("Precision: {:.2f}".format(cv_results['test_precision'].mean())) print("Recall: {:.2f}".format(cv_results['test_recall'].mean())) print("F1 Score: {:.2f}".format(cv_results['test_f1'].mean())) 39 ABBREVIATIONS Word Expansion NB Naïve Bayes SVM Support Vector Machine Max Ent Maximum Entropy CNN Convolutional Neural Network MNB Multinomial Naïve Bayes DT Decision Tree RFDT Random Forest Decision Tree ET Extra Trees LR Linear Regression KNN K Nearest Neighbour BoW Bag Of Words TF-IDF Term Frequency-Inverse Document Frequency Lex Lexicons C2OR context-to-object-word reward models Word Expansion Emo Emoticons PoS Part of Speech W2V Word to Vector GloVe Global Vectors for Word Representation BERT Bidirectional Encoder Representations from Transformers ELMo Embeddings from Language Models RoBERTa Robustly Optimized BERT ALBERT A Lite BERT 31 0 Distil BERT Distilled BERT ELECTRA Efficiently Learning an Encoder that Classifies Token Replacements Accurately SST Stanford Sentiment Treebank LTF-MICF Log Term Frequency-based Modified Inverse Class Frequency TCN Temporal Convolutional Networks O2SR Object to Surrounding word reward model 31 1