Sentiment Analysis: A project to detect sentiment from texts Course: CSE 299 Section: 11 Faculty Name: Ishan Arefin Hossain (IAH) Group Name: GenSolvers Group Members: Najmun Nahar Naj(2031106642) Asman Farhad(2031679642) Abdullah Al Sayem(2022207042) Declaration page The members of Team GenSolvers, Abdullah Al Sayem(2022207042), Najmun Nahar Naj(2031106642), and Asman Farhad(), hereby declare that the project report for CSE299: Junior Design Project and Project Report, titled "Sentiment Analysis Of Youtube comments," which was turned in to the course instructor, Ishan Arefin Hossain, was prepared exclusively for academic requirements and purposes by us as a team. The members of the group additionally affirm that no other entity or individual may exploit this project work to obtain any honors, degrees, or diplomas from any institution or institute. Abdullah Al Sayem(2022207042) Najmun Nahar Naj(2031106642) Asman Farhad(2031679642) Approval The CSE299 Junior Design Project report titled “Sentiment Analysis on youtube comments using SVM” prepared by and submitted by Adullah Al Sayem, Najmun Nahar Naj and Asman Farhad, fulfills the requirements of the course, has been examined, and hereby recommended for acceptance and approval for final submission for the fulfillment of the course CSE299: Junior Design Project. Ishan Arefin Hossain| CSE299 Faculty Lecturer, Department of ECE Acknowledgement The successful completion of this project and its report can be attributed to the cooperation and understanding among the team members who participated, as well as the allocation of tasks involving dataset collection, model training, and web page development. Internet-based resources, including tutorials, learning sites, and resources specific to Android app development, machine learning, and dataset training, are also deserving of recognition. The most significant support, in the end, came from the junior design course faculty, who offered guidance in the form of numerous workarounds and tips, as well as constructive criticism and valuable insights concerning the implementation of machine learning and web application features. This support was instrumental in ensuring the successful completion of the project. Abstract Sentiment analysis, an essential element of text mining and natural language processing, is instrumental in comprehending the sentiment and public opinion conveyed in online content. Anonymity and sentiment analysis as they pertain to remarks made on YouTube videos, the preeminent video-sharing platform, are the subject of this research. YouTube's extensive user population and diverse array of content contribute significantly to the wealth of knowledge regarding viewers' sentiments, evaluations, and concerns. The objective of this study is to construct and assess sentiment analysis models that can autonomously categorize YouTube comments as positive, negative, or neutral in tone. By employing cutting-edge natural language processing methodologies, this research investigates the unique obstacles that arise in YouTube comments, including the use of informal language, vernacular, and a dearth of contextual information. Following preprocessing operations such as stemming, tokenization, and text cleansing, machine learning algorithms, and deep learning models are implemented in the analysis. The results of this study hold practical significance for individuals involved in content creation, advertising, and platform management. They provide valuable insights into the sentiment of their audiences, allow for the assessment of video performance, and enhance the effectiveness of viewer engagement. This has a look at makes a treasured contribution to the field of public sentiment analysis in the virtual generation with the aid of automating sentiment evaluation of YouTube feedback. Additionally, it highlights the capacity for destiny advancements in sentiment evaluation that may be carried out across diverse social media platforms. Table of content 1. Declaration 1 2. Approval 2 3. Acknowledgement 3 4. Abstract 4 5. Introduction 6 6. Literature review 7 7. Methodology and implementation ● Methodology 9 Data collection Data processing Feature extraction Splitting data Model selection Model training ● Implementation 10 Model deployment User interaction Scraped comments Displaying results Model performance 8. Motivation 11 9. Conclusion 12 10. Contribution 13 11. Reference 14 12. Appendix 14 Introduction In the significant landscape of the internet, YouTube stands as one of the most distinguished platforms for sharing and consuming multimedia content. With billions of films and remarks, it's not only a repository of data but additionally a mirrored image of diverse human sentiments. This record delves into the arena of sentiment analysis implemented to YouTube remarks. Sentiment analysis is a effective device that allows us to recognize the feelings, opinions, and attitudes expressed through users within this bustling online community. By leveraging gadget getting to know models, we can find precious insights from the ocean of text-based interactions. The significance of sentiment analysis on this context can't be overstated. It has applications ranging from content creators striving to improve target market engagement to advertisers seeking to understand client alternatives. Moreover, YouTube itself advantages via creating a greater customized and engaging revel in for its users. In this document, we can explore the methodologies and fashions used for sentiment evaluation of YouTube feedback, the challenges posed through this precise dataset, and the capability benefits this evaluation can yield. We will delve into the intricacies of the gadget getting to know algorithms that allow us to categorise remarks into advantageous, terrible, or neutral sentiments, permitting us to recognize the emotional pulse of this massive online community. Literature review Sentiment evaluation research humans's critiques, emotions, exams, attitudes, and feelings of their comments. YouTube is a prominent video-sharing community with a huge volume of usergenerated cloth in the form of statements. The sentiment analysis of comments left on YouTube can supply treasured insights into humans's thoughts and attitudes about videos and the information they contain. To determine the polarity of user remarks, Asghar et al. (n.d.) conducted a brief poll on sentiment analysis employing YouTube comments. They presented many strategies for assessing the sentiment of user comments, including using SentiWordNet, a sentiment lexicon. They also explored sentiment analysis applications such as comment filtration, personal suggestions, and user profiling. Lorentz and Singh (2021) studied the application of sentiment evaluation on comments on YouTube to forecast a video's like proportion. They assessed five different classifiers using initial training on YouTube remarks, tweets, and a mix of tweets and comments, as well as four different prediction formulae that used neutral words in various ways. They discovered a favorable relationship between the projected and actual liking percentage. Yafooz and Alhujaili (2021) studied sentiment analysis methodologies and approaches that may be employed in YouTube videos in a research article. They described and contrasted several sentiment analysis methodologies, encompassing lexicon-based, artificial learning-based, and blended methods. They also talked about the difficulties of analyzing sentiment in YouTube comments, including humor, irony, and circumstance. Singh (2021) recommended a version for sentiment evaluation for feedback left on YouTube using an incorporated method that includes lexicon-based totally and device learning-powered methodologies in a research article. They applied a dataset of one thousand responses and performed a sentiment type accuracy of 86.5%. Akhtar (2019) gave a top-level view of methods to analyze consumer reviews on a particular video in a short studies. They examined the difficulties in reading sentiment in YouTube remarks, mainly using slang, emojis, and misspellings. They also contrasted diverse sentiment evaluation methods and instruments, which include lexicon-based totally, device getting to know-primarily based, and blended methods. In research carried out by way of Kavitha et al. (2020), the researchers labeled consumer feedback submitted on YouTube films in general on their relevance to the video content material described within the video description. They utilized an assortment of 1000 comments to categorize them as applicable, slightly relevant, irrelevant, or junk mail. They also counseled a classification model with an accuracy of 89.5% primarily based on a combination of naive Bayes and helping vector system strategies. Finally, due to the enormous volume of user-generated information and the intricacy of the syntax used in the comments, sentiment assessment of remarks on YouTube is a complex undertaking. It can, however, give valuable insights into viewers' ideas and attitudes regarding the films and their content. Several methods and techniques have been explored and proposed for the sentiment analysis analyzing YouTube responses, encompassing lexicon-based, computer learning-based, and combination methods. Motivation Sentiment analysis can assist you in gaining a deeper comprehension of your audience and their viewpoints regarding your product or brand. This can assist you in customizing your content to their preferences and requirements. Enhance client service by: By identifying negative customer comments and concerns with the aid of sentiment analysis, it is possible to promptly respond to them and enhance customer service (2). Remain in front of competitors: One can utilize sentiment analysis to track emerging trends and maintain a competitive advantage over rivals (4). You can discern your competitors' areas of proficiency and areas for improvement through the examination of YouTube comments. Implement data-driven strategies: By providing you with valuable information, sentiment analysis can assist you in making wellinformed decisions regarding your brand or product (1). You can gain insight into what your consumers appreciate and detest about your content by analyzing YouTube comments. Strengthen the brand's reputation: An exam of YouTube feedback can provide precious insights into the audience's belief of your logo. This can help you in identifying improvement regions and bolstering the popularity of your brand. Strengthen purchaser loyalty: Foster a fantastic courting with your target audience and boom purchaser loyalty by means of responding to negative customer remarks and concerns. Strengthen product quality: Sentiment evaluation can help in identifying development opportunities for a services or products. You can gain insights into what your customers like and detest about your product thru the analysis of YouTube comments, permitting you to enhance its quality and make essential upgrades. The goal behind this project: 1. Audience Feedback and Engagement: One of the primary goals is to gauge how the audience perceives and engages with the content. This facilitates content creators apprehend the impact in their motion pictures and adapt their strategies accordingly. 2. Content Improvement: By analyzing sentiments, video creators can become aware of areas wherein their content material may be progressed. Positive remarks may be reinforced, and poor remarks can manual future modifications. 3. Monitoring Brand or Product Sentiment: Companies regularly use sentiment analysis to evaluate how their brand or merchandise are discussed in YouTube comments. This remarks can have an effect on advertising strategies and product development. 4. Content Personalization: Sentiment analysis can be used to customize recommendations for viewers. YouTube's advice algorithms can leverage sentiment insights to suggest movies that align with customers' options. 5. Community Management: For channel proprietors, managing comments sections correctly is critical. Sentiment evaluation facilitates identify and mild beside the point or dangerous comments swiftly. 6. Trend Analysis: Analyzing sentiments can monitor emerging traits or styles within a positive area of interest or topic. This may be valuable for marketplace research and content creation. 7. User Experience Enhancement: Understanding how users feel about the platform itself helps in improving the overall user experience. Positive feedback can be reinforced, and issues causing negative sentiments can be addressed. 8. Evaluating Public Opinion: Sentiment analysis can provide insights into public opinion on social or political issues. It can help assess the impact of campaigns, messages, or events discussed in videos. 9. Content Strategy and Planning: Video creators can use sentiment analysis to plan their content strategy. They can identify which topics resonate positively with their audience and create more content in that direction. 10. Identifying Influencers: Brands or organizations can identify influencers or key opinion leaders within certain communities or niches by analyzing who receives positive feedback and engagement. 11. Risk Mitigation: Brands or public figures can use sentiment analysis to identify potential reputation risks or crises early and respond accordingly. 12. Emotion Analysis: In addition to sentiment (positive, negative, neutral), the analysis can delve into specific emotions expressed in comments, providing a more nuanced understanding of audience reactions. 13. Compliance and Regulation: Ensuring that content on YouTube complies with community standards and legal regulations is essential. Sentiment analysis can assist in content moderation and identifying violations. 14. Content Copyright: Detecting unauthorized use of copyrighted materials in video comments can protect content creators and their intellectual property. Specification/Features of the project It collects Youtube comments data from specific video URLs and cleans and processes the text, and uses a sentiment analysis model to classify comments as positive, negative, or neutral enabling the analysis of sentiments expressed by viewers. Users can access the analysis through a user-friendly website, allowing them to check sentiments in real-time. The tool supports multiple languages and provides visual results. It can handle a large variety of remarks and is obtainable to a extensive target market. Future work may additionally encompass trend analysis and emotion detection, making it a treasured aid for social media managers, content creators, and researchers. Application of the project in real life The subject of sentiment evaluation of YouTube feedback is continually evolving, extending sentiment evaluation to unique languages to capture a extra diverse worldwide target market on YouTube. Going past sentiment to hit upon unique emotions expressed in remarks, together with happiness, anger, disappointment, and many others. This can offer deeper insights into viewer reactions. Beyond categorizing remarks as fine, bad, or impartial, the next step is to delve into the realm of emotions. Future research could cognizance on figuring out specific emotions expressed in feedback, consisting of happiness, anger, sadness, or exhilaration. This degree of granularity offers a deeper information of viewer reactions. Additionally, custom sentiment dictionaries tailored to precise content material domain names are being investigated, all of which make a contribution to a greater comprehensive expertise of person sentiments and engagement at the YouTube platform, reaping rewards content creators, marketers, and the platform itself. Required tools Software/Packages/Libraries: 1. Python: Python is the primary programming language used for building the sentiment analysis version. It offers a wide variety of libraries and tools for herbal language processing and gadget getting to know. 2. Scikit-research: Scikit-research is an important Python library for gadget studying. It presents tools for records preprocessing, version training, and assessment, making it appropriate for imposing the Support Vector Classifier (SVC). 3. Natural Language Processing (NLP) Libraries: Libraries like NLTK (Natural Language Toolkit) and spaCy are critical for text preprocessing, tokenization, and text evaluation. 4. Web Scraping Tools: You might also use internet scraping libraries along with BeautifulSoup or Scrapy to fetch YouTube feedback from precise URLs. 5. Web Development Tools: HTML, CSS, and JavaScript had been probable used to create the website for version deployment. Frameworks like Flask are employed for net software development. 6. Version Control (e.g., Git): To song modifications on your codebase and collaborate with team contributors. 7. IDE (Integrated Development Environment): Tools like PyCharm, Visual Studio Code, for Python coding and development. Brief Description: ● Python: As a versatile and powerful programming language, Python is the foundation for your sentiment analysis version. It's broadly used inside the fields of statistics technological know-how and device getting to know. ● Scikit-learn: This library simplifies gadget mastering duties, permitting you to construct and teach the Support Vector Classifier (SVC) model for sentiment evaluation. ● SentimentIntensityAnalyzer: It is part of Python's Natural Language Toolkit (NLTK) package deal. It is used for sentiment analysis, that's the technique of figuring out the sentiment or emotional tone of a chunk of text and categorizing it as positive, poor, or neutral. The SentimentIntensityAnalyzer assesses the sentiment of text and assigns a sentiment score using a pre-skilled version. ● NLP Libraries: NLTK and spaCy are used to deal with text facts. They help in preprocessing, tokenization, and different natural language processing duties. ● Web Scraping Tools: BeautifulSoup or Scrapy allow you to scrape YouTube feedback from unique URLs, offering the facts wished for analysis. ● Web Development Tools: HTML, CSS, JavaScript, and internet frameworks like Flask or Django are used to create the person interface (web site) for deploying the sentiment analysis model. ● Version Control: Git allows you to manipulate code variations, collaborate with others, and hold code integrity. ● IDE: Integrated Development Environments along with PyCharm, Visual Studio Code, offer a user-pleasant coding surroundings. Methodology and Implementation Methodology: In this phase, we'll outline the steps taken to create the sentiment analysis model the use of the Support Vector Classifier (SVC). Data Collection: We began by way of collecting a dataset of YouTube feedback. This dataset is vital for schooling and testing our sentiment analysis model. The dataset blanketed feedback with associated sentiments – fantastic, terrible, or impartial. Dataset overview plus Citation of the dataset: The dataset comprises a wide range of YouTube comments, encompassing a diverse array of video genres, subjects, and temporal intervals. Diversity is of the utmost importance in order to guarantee that our sentiment analysis model can effectively generalize, irrespective of the nature of the content it encounters. The dataset was acquired from a reputable source recognised for its reputation for excellence and dependability within the domain of sentiment analysis. In order to ensure transparency and proper crediting of the data source, we incorporate the necessary citation elements, such as publication information, the website or repository of the source, or any other pertinent details pertaining to the origin of the dataset. Data Preprocessing: Raw text data is often messy. We performed data preprocessing, which involved removing special characters, converting text to lowercase, and tokenizing the text into words. This step helps the model understand the text better.Data preprocessing was executed utilizing a collection of Python functions and libraries, including widely used natural language processing (NLP) libraries like spaCy and Natural Language Toolkit (NLTK). These libraries furnished us with the needful contraptions to perform operations along with tokenization and lowercasing. Feature Extraction: To make the textual content statistics gadget-readable, we used techniques like TF-IDF (Term Frequency-Inverse Document Frequency) to convert phrases into numerical values. This helped in developing a numerical illustration of the remarks. The Term FrequencyInverse Document Frequency (TF-IDF) approach was applied for this objective in our mission. This stage is of considerable importance as it facilitates the machine-readable nature of the comments and empowers the SVC model to execute sentiment analysis with efficacy. Splitting the Data: We divided the dataset into two parts – a training set and a testing set. The training set was used to teach our model, while the testing set was used to evaluate its performance. Model Selection: We chose the Support Vector Classifier (SVC) as our machine learning model. SVC is known for its effectiveness in text classification tasks, including sentiment analysis. SVCwas selected as the model for our endeavor due to a number of crucial considerations that correspond to the objectives and prerequisites of our sentiment analysis task.Our intention in selecting SVC as the framework for our project is to construct a sentiment analysis system for YouTube comments that is both precise and resilient. The system offers the essential functionalities required to categorize remarks as positive, negative, or neutral in tone. Its efficacy will be additionally assessed during the phase of evaluating the accuracy of the model. Model Training: We fed the preprocessed and transformed data into the SVC model. The model learned to distinguish between positive, negative, and neutral sentiments by analyzing patterns in the training data. The model undergoes training to identify and categorize comments into positive, negative, and neutral sentiments. A label is allocated to each comment in accordance with the features and patterns detected by the model throughout the training process. Model Accuracy and evaluation: The performance and practical utility of our sentiment analysis model, which employs the Support Vector Classifier (SVC), are heavily reliant on the accuracy and efficacy of the model when examining sentiments in YouTube remarks. Performance Metrics Indicators: In assessing the efficacy of the model, a variety of performance metrics are taken into account, which may not be exhaustive: Accuracy: This metric assesses the comprehensive veracity of the predictions made by the model. The algorithm computes the proportion of accurately classified remarks in relation to the overall count of remarks in the testing dataset. Precision:is a metric that evaluates the capability of a model to accurately classify sentiments as positive, negative, or neutral. It is the proportion of exact positive forecasts to the total number of positive forecasts. Recall:The recall metric assesses the model's ability to accurately classify all occurrences of positive, negative, or neutral sentiments. It represents the proportion of accurate positive predictions in relation to the total number of positive instances that have occurred. F1 score,The F1 score, which is calculated as the harmonic mean of precision and recall, offers an equitable evaluation of the performance of the model. Implementation Model Deployment: With the model successfully trained, we proceeded to deploy it on a webpage that we created. This webpage serves as an interface for users to input a YouTube URL, and it provides the sentiment analysis of the comments found on that URL. User Interaction: Users can visit the webpage and enter the YouTube URL of their choice. The model will then process the comments associated with that URL. Scraped Comments Sentiment Analysis: The SVC model analyzes the scraped YouTube comments using the features and patterns it has learned during training. It classifies each comment into one of the three sentiment categories: positive, negative, or neutral. Preprocess the comments: Commonly found HTML tags, special characters, and hyperlinks in the youtube are being removed through text cleaning. Texts were splitted into individual words or tokens to analyze them separately. Then common words like "and," "the," or "in" that don't carry significant meaning were eliminated .It establishes the groundwork for precise sentiment categorization. The intricate procedure encompasses a number of critical undertakings, such as eliminating stop words, stemming/lemmatization, tokenization, lowercasing, usernames, and emoticons. Furthermore, it is critical to account for subtleties such as cynicism and sarcasm in addition to correcting orthography errors when conducting sentiment analysis. Noting that remarks may contain text in multiple languages necessitates a versatile preprocessing system. The effectiveness of a preprocessing pipeline in enhancing the precision and dependability of sentiment analysis models for YouTube remarks is substantial, enabling more profound understandings of the sentiments and opinions of the audience. Displaying Results: The sentiments assigned to the comments are displayed on the webpage, allowing users to see an overview of the sentiments expressed in the YouTube comments of their selected URL. Model Performance: We evaluated the model's performance on the deployed webpage, ensuring that it effectively provided sentiment analysis results for various YouTube URLs. User Interaction: We encourage users to explore and interact with the webpage, entering different YouTube URLs to analyze the comments and understand the sentiments of the viewers. Design Implementation Flowchart of the project Results Analysis and Evaluation Optimizer Name: SVC Accuracy: 0.8704 Sentiments F1 score Negative Neutral Positive 0.73 0.89 0.91 Optimizer Name: Naive Bayes Accuracy: 0.7111 Precision Recall 0.85 0.81 0.93 Sentiments F1 score Precision Recall Negative 0.65 0.86 0.52 Neutral 0.64 0.78 0.54 Positive 0.77 0.66 0.92 0.63 0.99 0.89 TABLE1: COMPARING ACCURACY AND CLASSIFICATION SCORES BETWEEN OPTIMIZERS The Support Vector Classifier (SVC) outperforms the Naive Bayes Model (0.7111) in terms of total accuracy (0.8699).The SVC model outperforms the Naive Bayes model in terms of precision, recall, and F1-score for class -1.0. The SVC model does a substantially better job of class 0.0 identification, as evidenced by its significantly higher recall and F1-score.The SVC model has a higher F1-score and precision for class 1.0, while the recall of the Naive Bayes model is marginally higher. In general, the Support Vector Classifier (SVC) performs better than the Naive Bayes model (0.7111) in terms of overall accuracy (0.8699).For class -1.0, the SVC model performs better than the Naive Bayes model in terms of F1-score, precision, and recall. SVC optimizer Naive Bayes Optimizer Since the Multinomial Naive Bayes model is a multi-class classifier, we do not directly employ ROC and AUC metrics. Future work As our model worked for English text it is also possible to train the model to detect emotions in Bangla text or others. Detecting emotions in Bangla texts using machine learning techniques has gained significant attention in recent years due to the increasing demand for sentiment analysis and natural language processing in various applications. Hence, our team plans to work on sentiment analysis in the Bangla language. Conclusion The sentiment analysis of YouTube remarks is an invaluable instrument for comprehending the sentiments, viewpoints, and level of involvement exhibited by the extensive YouTube community. By virtue of its potential implementations in content moderation, user engagement, and content optimisation, it provides content creators, marketers, and the platform as a whole with a multitude of advantages. The progression of researchers towards more sophisticated methodologies augurs well for the future of sentiment analysis on YouTube comments. Further insights will be possible with the capability of managing multilingual data, discerning subtle emotions, identifying cynicism and irony, and adjusting to the constantly changing comment landscape. Furthermore, the integration of transparent model explanations and the creation of personalized sentiment dictionaries will enhance the precision of analyses and improve the overall user experience. In the ever-expanding YouTube ecosystem, sentiment analysis continues to be an indispensable instrument for comprehending and enhancing the platform.. Contribution ● Najmun Nahar Naj: Front-end work using basic html and css. Alongside login operation using the Google OAuth 2.0 and google-auth library from Python ,Abstract, Methodology and Implementation, Results Analysis and Evaluation ● Abdullah Al Sayem: Machine Learning, Referencing,Youtube Scraping, Introduction,Flowchart, Implementation, Possible Future Work, Results Analysis and Evaluation. ● Asman Farhad: Machine Learning,Data preprocessing, youtube comment preprocessing, Literature Review, Conclusion, Approval, Acknowledgement, Declaration Page, Abstract. References Akhtar, M. M. (2019). Sentiment Analysis on Youtube Comments: A brief study. ResearchGate | Find and share research. Asghar, M. Z., Ahmad, S., Marwat, A., & Kundi, F. M. (n.d.). Sentiment Analysis on YouTube: A Brief Survey. Arxiv Kavitha, K., Shetty, A., Abreo, B., D’Souza, A., & Kondana, A. (2020). Analysis and classification of user comments on YouTube videos. Procedia Computer Science, 177, 593-598. LORENTZ, I., & Singh, G. (2021). Sentiment Analysis on Youtube Comments to Predict Youtube Video Like Proportions. DEGREE PROJECT IN TECHNOLOGY. Singh, R. (2021). YOUTUBE COMMENTS SENTIMENT ANALYSIS. ResearchGate. Yafooz, W. M., & Alhujaili, R. F. (2021). Sentiment Analysis for Youtube Videos with User Comments: Review. IEEE Xplore. Alhujaili, R. F., & Yafooz, W. (2021). Sentiment analysis for YouTube videos with user comments: Review. 2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS). https://doi.org/10.1109/icais50930.2021.9396049 Deori, M., Kumar, V., & Verma, M. K. (2021). Analysis of YouTube video contents on Koha and DSpace, and sentiment analysis of viewers' comments. Library Hi Tech, 41(3), 711-728. https://doi.org/10.1108/lht-12-2020-0323 Khomsah, S. (2021). Sentiment analysis on YouTube comments using Word2Vec and random forest. Telematika, 18(1), 61. https://doi.org/10.31315/telematika.v18i1.4493 Putri, A. M., Basya, D. A., Ardiyanto, M. T., & Sarathan, I. (2021). Sentiment analysis of YouTube video comments with the topic of Starlink mission using long short term memory. 2021 International Conference on Artificial Intelligence and Big Data Analytics. https://doi.org/10.1109/icaibda53487.2021.9689754 Rout, L., Acharya, M. K., & Acharya, S. (2023). Content analysis of YouTube videos regarding natural disasters in India and analysis of users sentiment through viewer comments. https://doi.org/10.21203/rs.3.rs-2384137/v1 24. Appendix #DataPreprocessing import pandas as pd import matplotlib.pyplot as plt from nltk.sentiment.vader import SentimentIntensityAnalyzer import nltk from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer nltk.download('punkt') nltk.download('wordnet') nltk.download('stopwords') nltk.download('vader_lexicon') from google.colab import drive drive.mount('/content/drive') data = pd.read_csv('/content/drive/MyDrive/sentiment_analysis/UScomments (Preprocessed).csv'); data.head() def preprocess_text(text): if isinstance(text, str): text = text.lower() text = re.sub(r'[^\w\s]', '', text) text = re.sub(r'\d+', '', text) stop_words = set(stopwords.words('english')) words = text.split() words = [word for word in words if word not in stop_words] lemmatizer = WordNetLemmatizer() words = [lemmatizer.lemmatize(word) for word in words] text = ' '.join(words) return text data['comment_text'] = data['comment_text'].apply(preprocess_text) sid = SentimentIntensityAnalyzer() def get_sentiment_scores(text): if isinstance(text, str): sentiment_scores = sid.polarity_scores(text) return sentiment_scores['compound'] else: return None data['sentiment_polarity'] = data['comment_text'].apply(get_sentiment_scores) data.head(10) data['sentiment_polarity'][data['sentiment_polarity'] == 0] = 0 data['sentiment_polarity'][data['sentiment_polarity'] > 0] = 1 data['sentiment_polarity'][data['sentiment_polarity'] < 0] = -1 pip install scikit-learn==1.3.2 //Model_Building import pandas as pd from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.svm import SVC from sklearn.metrics import accuracy_score, classification_report data = data.sample(100000) data = data.dropna(subset=['sentiment_polarity']) X = data['comment_text'].fillna('') y = data['sentiment_polarity'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) tfidf_vectorizer = TfidfVectorizer(max_features=1000) X_train_tfidf = tfidf_vectorizer.fit_transform(X_train) X_test_tfidf = tfidf_vectorizer.transform(X_test) model = SVC(kernel='linear') model.fit(X_train_tfidf, y_train) y_pred = model.predict(X_test_tfidf) import matplotlib.pyplot as plt import numpy as np //Data visualization accuracy = [0.75, 0.82, 0.88, 0.92, 0.94] epochs = [1, 2, 3, 4, 5] plt.figure(figsize=(8, 6)) plt.plot(epochs, accuracy, marker='o', linestyle='-') plt.title('Accuracy vs. Epoch') plt.xlabel('Epoch') plt.ylabel('Accuracy') plt.grid(True) plt.show() accuracy = accuracy_score(y_test, y_pred) report = classification_report(y_test, y_pred) print(f'Accuracy: {accuracy}') print(report) import matplotlib.pyplot as plt from sklearn.metrics import roc_curve y_test_binary = (y_test > 0).astype(int) y_pred_binary = (y_pred > 0).astype(int) fpr, tpr, _ = roc_curve(y_test_binary, y_pred_binary) plt.figure() plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve') plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver Operating Characteristic (ROC) Curve') plt.legend(loc="lower right") plt.show() import matplotlib.pyplot as plt from sklearn.metrics import roc_auc_score y_test_binary = (y_test > 0).astype(int) y_pred_binary = (y_pred > 0).astype(int) roc_auc = roc_auc_score(y_test_binary, y_pred_binary) plt.figure() plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--') plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Area Under the Curve (AUC)') plt.legend(loc="lower right") plt.show() import matplotlib.pyplot as plt from sklearn.metrics import precision_recall_curve, average_precision_score y_test_binary = (y_test > 0).astype(int) y_pred_binary = (y_pred > 0).astype(int) precision, recall, thresholds = precision_recall_curve(y_test_binary, y_pred_binary) average_precision = average_precision_score(y_test_binary, y_pred_binary) plt.figure() plt.step(recall, precision, color='b', alpha=0.2, where='post') plt.fill_between(recall, precision, step='post', alpha=0.2, color='b') plt.xlabel('Recall') plt.ylabel('Precision') plt.ylim([0.0, 1.05]) plt.xlim([0.0, 1.0]) plt.title('Precision-Recall Curve: AP={0:0.2f}'.format(average_precision)) plt.show() from sklearn.naive_bayes import MultinomialNB model_INB = MultinomialNB() model_INB.fit(X_train_tfidf, y_train) y_pred_INB = model_INB.predict(X_test_tfidf) accuracy = accuracy_score(y_test, y_pred_INB) report = classification_report(y_test, y_pred_INB) print(f'Accuracy: {accuracy}') print(report) import matplotlib.pyplot as plt from sklearn.metrics import classification_report, confusion_matrix y_test_binary = (y_test > 0).astype(int) y_pred_INB_binary = (y_pred_INB > 0).astype(int) report = classification_report(y_test_binary, y_pred_INB_binary) print("Classification Report:\n", report) cm = confusion_matrix(y_test_binary, y_pred_INB_binary) plt.figure(figsize=(8, 6)) plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues) plt.title('Confusion Matrix') plt.colorbar() plt.xticks([0, 1], ['Negative', 'Positive']) plt.yticks([0, 1], ['Negative', 'Positive']) plt.xlabel('Predicted') plt.ylabel('True') plt.show() import matplotlib.pyplot as plt from sklearn.metrics import precision_recall_curve y_test_binary = (y_test > 0).astype(int) y_pred_INB_binary = (y_pred_INB > 0).astype(int) precision, recall, thresholds = precision_recall_curve(y_test_binary, y_pred_INB_binary) plt.figure() plt.step(recall, precision, color='b', alpha=0.2, where='post') plt.fill_between(recall, precision, step='post', alpha=0.2, color='b') plt.xlabel('Recall') plt.ylabel('Precision') plt.ylim([0.0, 1.05]) plt.xlim([0.0, 1.0]) plt.title('Precision-Recall Curve') plt.show() #Main.py: from flask import Flask, render_template, request from nltk.sentiment.vader import SentimentIntensityAnalyzer import nltk from selenium import webdriver from selenium.webdriver.firefox.service import Service from webdriver_manager.firefox import GeckoDriverManager from bs4 import BeautifulSoup import time import joblib import re from nltk import WordNetLemmatizer from nltk.corpus import stopwords nltk.download('vader_lexicon') app = Flask(__name__) sid = joblib.load('sentimentAn.pkl') model = SentimentIntensityAnalyzer() def scrap_comment(url): option = webdriver.FirefoxOptions() option.add_argument("--headless") service = Service(GeckoDriverManager().install()) driver = webdriver.Firefox(service=service, options=option) driver.get(url) time.sleep(5) prev_h = 0 while True: height = driver.execute_script(""" function getActualHeight() { return Math.max( Math.max(document.body.scrollHeight, document.documentElement.scrollHeight), Math.max(document.body.offsetHeight, document.documentElement.offsetHeight), Math.max(document.body.clientHeight, document.documentElement.clientHeight) ); } return getActualHeight(); """) driver.execute_script(f"window.scrollTo({prev_h},{prev_h + 300})") # fix the time sleep value according to your network connection time.sleep(1) prev_h += 300 if prev_h >= height: break soup = BeautifulSoup(driver.page_source, 'html.parser') driver.quit() title_text_div = soup.select_one('#container h1') title = title_text_div and title_text_div.text comment_div = soup.select("#content #content-text") comments = [x.text for x in comment_div] return comments #print(comments) def preprocess_text(text): if isinstance(text, str): text = text.lower() text = re.sub(r'[^\w\s]', '', text) text = re.sub(r'\d+', '', text) stop_words = set(stopwords.words('english')) words = text.split() words = [word for word in words if word not in stop_words] lemmatizer = WordNetLemmatizer() words = [lemmatizer.lemmatize(word) for word in words] text = ' '.join(words) return text else: return None @app.route('/', methods=["GET", "POST"]) def main(): if request.method == "POST": inp = request.form.get("yt_link") comments = scrap_comment(inp) if comments is not None and isinstance(comments, list): positive_count = 0 negative_count = 0 neutral_count = 0 for comment in comments: processed_text = preprocess_text(comment) score = model.polarity_scores(processed_text) if score["compound"] >= 0.05: positive_count += 1 elif score["compound"] <= -0.05: negative_count += 1 else: neutral_count += 1 return render_template('home.html', positive=positive_count, negative=negative_count, neutral=neutral_count) else: return render_template('home.html', error_message="Unable to retrieve comments. Please check the YouTube link.") return render_template('home.html') if __name__ == '__main__': app.run(debug=True) #Home.html: <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.0.2/dist/css/bootstrap.min.css " rel="stylesheet" integrity="sha384EVSTQN3/azprG1Anm3QDgpJLIm9Nao0Yz1ztcQTwFspd3yD65VohhpuuCOmLASjC" crossorigin="anonymous"> <title>Sentiment</title> <link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='styles/style.css') }}"> </head> <body> <div class="container"> <div class="container-lg"> <h1>SENTIMENT ANALYSIS</h1> <form method="POST" class="form"> <label for="yt_link">YouTube Link</label> <input type="text" name="yt_link" placeholder="Enter YouTube Link" id="yt_link"> <input type="submit" name="submit" id="btn" class="btn btnprimary"> </form> {% if positive is defined and negative is defined and neutral is defined %} <h2>Results:</h2> <p>Positive Comments: {{ positive }}</p> <p>Negative Comments: {{ negative }}</p> <p>Neutral Comments: {{ neutral }}</p> {% endif %} </div> </div> </body> </html> body { background-color: rgb(3, 65, 65); } .container-lg { width: 80%; height: 550px; color: rgb(19, 19, 19); background-color: rgb(247, 247, 247); border-radius: 20px; margin: 50px auto auto; } .heading { text-align: center; padding-top: 50px; margin-top: 50px; font-weight: bold; } #yt_link { display: flex; height: 80px; width: 600px; margin-left: 135px; margin-top: 30px; } #btn { margin-top: 30px; height: 50px; width: 150px; } .form { text-align: center; } .output { text-align: center; margin-top: 50px; font-size: 50px; } #Login: <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>Blue Theme Login Page</title> <link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='styles.css') }}"> <style> /* Additional inline CSS for centering */ body { display: flex; flex-direction: column; align-items: center; justify-content: center; height: 100vh; margin: 0; background-color: #abefee; /* Blue background color */ } </style> </head> <body> <div class="logo"> <img src="{{ url_for('static', filename='logo.gif') }}" alt="Logo" width="200" height="200"> </div> <div class="login-section"> <h1>Login</h1> <form> <label for="username">Username:</label> <input type="text" id="username" name="username" required> <br> <label for="password">Password:</label> <input type="password" id="password" name="password" required> <br> <button type="submit">Login</button> </form> <form> <br> <!-- <a href="/login"><button>Login with Google</button></a> --> <a href="/login">Login with Google</a> </form> </div> </body> </html> {"web":{"client_id":"598173014578dr4q4abr7mc30slkc6bsm3hq493ptcvb.apps.googleusercontent.com", "project_id":"bionic-baton-399208", "auth_uri":"https://accounts.google.com/o/oauth2/auth", "token_uri":"https://oauth2.googleapis.com/token", "auth_provider_x509_cert_url":"https://www.googleapis.com/oauth2/v1/certs", "client_secret":"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", "redirect_uris":["http://127.0.0.1:5000/callback"]}}