Department of Computer Science Lok Nayak Jai Prakash Institute of Technology Chapra-84130 Internship Report On “Offensive Multimodal Post Detection Using Machine Learning” Done at Indian Institute of Technology Patna Submitted in partial fulfillment of the requirements of awarding B-Tech degree in Computer Science & Engineering. By Rohit Gupta Registration no. – 19105117030 DECLARATION I Rohit Gupta Registration No.- 19105117030, student of B-tech CSE 2019-2023, Lok Nayak Jai Prakash Institute of Technology, Chapra hereby declare that this report is a bonafide record of the organisation study did under the supervision of Dr. Krishanu maity of IIT Patna from 12th of December 2022 to 12th of February 2023 on topic of “Offensive Multimodal Post Detection Using Machine Learning”. This work has not been under taken or submitted elsewhere in connection with any other academic course. Place: Chapra Date: Rohit Gupta Registration no – 19105117030 Computer Science Engineering Acknowledgement I kindly take this opportunity to express my sincere expression of gratitude to each and every person who helped me in the completion of this work. I am very glad to express my gratitude to Dr. Krishanu Maity for his timely help all through my study and for his valuable suggestions, advice, and encouragement throughout the Internship. I owe my gratitude to Dr. Chanchal Suman for her constant support and encouragement during the internship. I am also grateful to my parents and friends for their help throughout the process. About The Internship During my internship at IIT Patna from December 12th to February 12th, I worked under the guidance of Krishanu Maity on the topic of "Offensive Multimodal Post Detection Using Machine Learning." The objective of the internship was to develop an automated system capable of detecting offensive content in various forms, such as text, images, and videos, using machine learning techniques. Throughout the internship, I was exposed to a wide range of tasks and responsibilities related to the project. I began by familiarising myself with the existing literature and research in the field of offensive content detection. This involved studying relevant papers, articles, and existing frameworks to gain a comprehensive understanding of the subject matter. After the initial research phase, I collaborated closely with Krishanu Maity and other team members to design and implement a multimodal deep learning model for offensive content detection. This involved collecting and preprocessing a large dataset of offensive and non-offensive content, which was crucial for training and evaluating the model's performance. I worked extensively with various machine learning techniques, such as natural language processing (NLP), computer vision, and deep learning algorithms. I employed tools like moviepy, pytube, and Scikit-Learn to develop and fine-tune the model architecture. This included creating text embedding models, and convolutional neural networks (CNNs) for image analysis. To ensure the model's accuracy, conducted a performance analysis and compared the results with existing state-of-the-art methods to validate the effectiveness of our proposed approach. In addition to the technical aspects, I actively participated in team meetings, presenting my progress, discussing challenges, and brainstorming solutions with the guidance of Krishanu Maity and other researchers. This collaborative environment allowed me to enhance my communication and teamwork skills while also gaining insights into the broader research community. INDEX 1. Introduction o Background and motivation o Objectives and scope o Methodology overview 2. Literature Review o Overview of offensive content detection o Previous research and approaches o Existing multimodal detection frameworks 3. Data Collection and Preprocessing o Data collection process o Data preprocessing techniques o Dataset description 4. Model Design and Implementation o Decision Tree o Naive Bayes classifiers o Support Vector Machine o Multi-Layer Perceptron (MLP) o Model Optimisation 5. Challenges and Solutions o Challenges encountered during the internship o Strategies and solutions employed 6. Discussion and Conclusion o Summary of the internship experience 1. INTRODUCTION 1.1 Background and Motivation The field of machine learning has experienced significant advancements in recent years, revolutionising various industries and domains. With its ability to analyse vast amounts of data and extract valuable insights, machine learning has become a powerful tool for solving complex problems. Recognising the potential of machine learning, I undertook an internship to gain practical experience in this field and explore its applications in offensive content detection. The proliferation of social media platforms and online communities has led to an increase in offensive and harmful content, posing challenges for content moderation. Offensive content, such as hate speech, cyberbullying, and harassment, can have detrimental effects on individuals and communities. Therefore, there is a growing need for automated systems capable of detecting and mitigating such content. 1.2 Objectives and Scope The primary objective of my internship was to develop an offensive content detection system using machine learning techniques. The goal was to create an automated solution capable of analysing various forms of content, including text, images, and videos, to accurately identify and flag offensive postsable of analysing various forms of content, including text, images, and videos, to accurately identify and flag offensive posts. By achieving this objective, the internship aimed to contribute to the broader field of content moderation and facilitate the creation of safer online environments. The scope of the internship involved conducting research on existing approaches and algorithms for offensive content detection. It also included designing and implementing a machine learning model that leverages multimodal data to improve the accuracy of detection. The internship further encompassed collecting and preprocessing a suitable dataset, training and evaluating the model, and analysing its performance against existing state-of-the-art methods. 1.3 Methodology Overview The methodology adopted for the internship involved a systematic and iterative approach to achieving the project objectives. It started with an extensive literature review to gain a comprehensive understanding of offensive content detection and the relevant machine learning techniques. 2. LITERATURE REVIEW 2.1 Overview of Offensive Content Detection Offensive content detection is a critical task in the field of content moderation, aimed at identifying and mitigating harmful or offensive posts across various online platforms. This section provides an overview of offensive content detection techniques and their significance in creating safer online environments. Offensive content can take different forms, including hate speech, cyberbullying, harassment, and explicit or violent imagery. Traditional approaches to content moderation relied on manual review and user reporting, which proved to be timeconsuming and inefficient. The advent of machine learning techniques has enabled the development of automated systems capable of detecting offensive content at scale. 2.2 Previous Research and Approaches Numerous research studies have been conducted on offensive content detection, focusing on both unimodal (text, image, or video-based) and multimodal approaches. This subsection explores the key findings and methodologies employed in previous research efforts. In text-based offensive content detection, approaches often involve natural language processing (NLP) techniques. Various algorithms, such as recurrent neural networks (RNNs), long short-term memory (LSTM), and transformers like BERT (Bidirectional Encoder Representations from Transformers), have been used for sentiment analysis, profanity detection, and context-based classification of offensive text. Image-based offensive content detection primarily relies on computer vision techniques. Convolutional neural networks (CNNs) have been widely used for feature extraction and classification of offensive or explicit imagery. Transfer learning approaches, utilising pre-trained models like VGGNet, ResNet, or InceptionNet, have shown promising results in image-based offensive content detection. Video-based offensive content detection poses additional challenges due to the temporal nature of the data. Approaches in this domain often involve frame-level analysis and sequential modelling using recurrent neural networks (RNNs) or 3D convolutional neural networks (3D CNNs). Spatio-temporal feature extraction and fusion techniques have also been explored to capture both visual and temporal cues in video content. 2.3 Existing Multimodal Detection Frameworks Multimodal offensive content detection leverages information from multiple modalities, such as text, images, and videos, to improve the accuracy and robustness of the detection system. This subsection discusses existing multimodal detection frameworks and their contributions. Multimodal approaches typically combine unimodal models, such as text classifiers, image classifiers, and video classifiers, into a unified framework. These models are designed to capture complementary features from different modalities and make joint predictions. Fusion techniques, such as late fusion (combining predictions at the decision level) or early fusion (combining features at the input level), are commonly used to integrate information from multiple modalities. Additionally, multimodal frameworks often explore the use of attention mechanisms to dynamically weigh the importance of different modalities or specific regions within modalities. Attention mechanisms enable the model to focus on relevant information and improve the overall performance of the offensive content detection system. Several studies have demonstrated the effectiveness of multimodal detection frameworks, showcasing improved performance compared to unimodal approaches. These frameworks have the potential to provide a comprehensive and nuanced understanding of offensive content, leveraging the strengths of different modalities. By reviewing the literature on offensive content detection, previous research efforts, and existing multimodal detection frameworks, valuable insights can be gained to inform the design and implementation of the offensive multimodal post detection system in the internship. These findings contribute to the development of an effective and state-of-the-art solution in the field of content moderation. 3. Data Collection and Pre-processing 3.1 Data Collection Process The success of offensive multimodal post detection relies heavily on the availability of a diverse and representative dataset. This section discusses the data collection process undertaken for the internship project. The data collection process involved several steps. Firstly, a thorough review of existing datasets related to offensive content detection was conducted. Various publicly available datasets, such as Hate Speech and Offensive Language (HSOL), Multimodal Offensive Content Dataset (MOC), and YouTube dataset, were considered. Additionally, specific platforms or social media websites that are known for offensive content were explored to collect data. After identifying potential datasets, the necessary permissions and ethical considerations were addressed to ensure compliance with data usage policies and user privacy. In some cases, collaboration with platform administrators or obtaining consent from users was necessary. Next, a systematic approach was employed to gather offensive and non-offensive posts across multiple modalities. This involved leveraging specific keywords, hashtags, or user-defined labels to retrieve relevant content. Additionally, the posts were collected from diverse sources to capture variations in content types, contexts, and user demographics. It's important to note that during the data collection process, careful consideration was given to ethical guidelines and the potential impact of handling offensive content. Measures were taken to maintain the privacy and anonymity of individuals while preserving the integrity of the data. 3.2 Data Preprocessing Techniques Once the data was collected, it underwent preprocessing to ensure data quality, consistency, and compatibility for subsequent analysis and model training. This subsection highlights the data preprocessing techniques employed. For text data, common preprocessing steps included removing special characters, punctuation, and URLs. Text normalisation techniques, such as lowercasing, stemming, or lemmatization, were applied to reduce vocabulary size and improve generalisation. Stop words and irrelevant terms specific to the dataset were also eliminated to focus on meaningful content. In the case of image data, pre-processing involved resizing images to a uniform size to ensure compatibility across the dataset. Techniques like centre cropping or resizing with aspect ratio preservation were employed to maintain the integrity of the visual content. Additionally, normalisation techniques, such as scaling pixel values to a specific range or using a precomputed mean and standard deviation, were applied to facilitate effective model training. Video data underwent pre-processing steps to extract relevant frames or keyframes from the videos. Keyframes represent representative frames that capture important visual information. The keyframes were extracted using techniques such as uniform sampling or motion analysis. Similar to image data, resizing and normalisation techniques were applied to ensure consistency and compatibility. 3.3 Dataset Description The dataset used for offensive multimodal post detection comprised offensive and non-offensive posts across multiple modalities, including text, images, and videos. The dataset was carefully curated and labelled to include a wide range of offensive content types, such as hate speech, cyberbullying, or explicit imagery. The dataset consisted of a significant number of posts from diverse sources, including social media platforms, online forums, and multimedia sharing websites. The dataset aimed to capture variations in offensive content based on language, cultural context, and geographical locations. To maintain data integrity and ensure accurate labelling, a rigorous annotation process was undertaken. Human annotators with expertise in content moderation were involved in labelling the posts as offensive or non-offensive. In cases where the classification of offensive content required further granularity, additional labels or severity levels were assigned. The dataset was divided into appropriate subsets for training, validation, and testing, adhering to standard practises to evaluate the model's performance accurately. Stratified sampling or random sampling techniques were employed to ensure a balanced representation of offensive and non-offensive posts across the subsets. 4. Data Modeling Data modeling in machine learning refers to the process of creating a mathematical representation or model that captures the underlying patterns, relationships, and characteristics of a dataset. This model is then used to make predictions, classify new data points, or gain insights from the data. 4.1 Decision Tree Modeling A decision tree is one of the most powerful tools of supervised learning algorithms used for both classification and regression tasks. It builds a flowchart-like tree structure where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label. It is constructed by recursively splitting the training data into subsets based on the values of the attributes until a stopping criterion is met, such as the maximum depth of the tree or the minimum number of samples required to split a node. During training, the Decision Tree algorithm selects the best attribute to split the data based on a metric such as entropy or Gini impurity, which measures the level of impurity or randomness in the subsets. The goal is to find the attribute that maximizes the information gain or the reduction in impurity after the split. A tree can be “learned” by splitting the source set into subsets based on Attribute Selection Measures. Attribute selection measure (ASM) is a criterion used in decision tree algorithms to evaluate the usefulness of different attributes for splitting a dataset. The goal of ASM is to identify the attribute that will create the most homogeneous subsets of data after the split, thereby maximizing the information gain. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node all has the same value of the target variable, or when splitting no longer adds value to the predictions. The construction of a decision tree classifier does not require any domain knowledge or parameter setting and therefore is appropriate for exploratory knowledge discovery. Decision trees can handle high-dimensional data. Decision Tree Model accuracy and Precision of the Model 4.2 Naive Bayes Classifiers Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. Bayes’ Theorem Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already occurred. Bayes’ theorem is stated mathematically as the following equation: where A and B are events and P(B) ≠ 0. Basically, we are trying to find probability of event A, given the event B is true. Event B is also termed as evidence. P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is seen). The evidence is an attribute value of an unknown instance(here, it is event B). P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen. Naïve Bayes Classification Model accuracy and Precision of the Model 4.3 Support Vector Machine Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine Learning. The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane. SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called as support vectors, and hence algorithm is termed as Support Vector Machine. Support Vector Machine Model accuracy and Precision of the Model 4.4 Neural Network Model-Multi-layer Perceptron (MLP) A multi-layer perceptron has one input layer and for each input, there is one neuron(or node), it has one output layer with a single node for each output and it can have any number of hidden layers and each hidden layer can have any number of nodes. A schematic diagram of a Multi-Layer Perceptron (MLP) is depicted below. In the multi-layer perceptron diagram above, we can see that there are three inputs and thus three input nodes and the hidden layer has three nodes. The output layer gives two outputs, therefore there are two output nodes. The nodes in the input layer take input and forward it for further process, in the diagram above the nodes in the input layer forwards their output to each of the three nodes in the hidden layer, and in the same way, the hidden layer processes the information and passes it to the output layer. Neural Network Model-Multi-layer Perceptron(MLP) Model accuracy and Precision of the Model 4.5 Model Optimisation 5. Challenges and Solutions 5.1 Challenges Encountered During the Internship During the internship focused on offensive multimodal post detection using machine learning, several challenges may have arisen. These challenges could include: 1. Data Collection: Gathering a diverse and representative dataset of offensive and nonoffensive posts across multiple modalities can be a challenging task. Finding suitable sources, obtaining permissions, and ensuring data privacy can pose difficulties. 2. Labelling and Annotation: Accurately labelling offensive and non-offensive posts in the dataset requires human annotators with expertise in content moderation. Maintaining consistency and dealing with subjective content can be challenging. 3. Modality Integration: Integrating multiple modalities, such as text, images, and videos, to create an effective multimodal detection system can be complex. Combining different types of data and capturing their interactions in a unified model requires careful design. 4. Model Training and Optimisation: Training deep learning models on multimodal data requires significant computational resources and careful hyperparameter tuning. Managing training times, avoiding overfitting, and achieving a balance between modalities can be challenging. 5.2 Strategies and Solutions Employed To address these challenges, various strategies and solutions can be employed: 1. Collaborative Data Collection: Collaborating with platform administrators, researchers, or data providers can help overcome data collection challenges. Establishing partnerships and leveraging existing datasets or data-sharing initiatives can ensure a more diverse and comprehensive dataset. 2. Annotator Training and Guidelines: Providing detailed guidelines and conducting training sessions for annotators can help improve consistency and accuracy in labelling offensive content. Regular communication and feedback with annotators can address challenges and ensure high-quality annotations. 3. Transfer Learning and Pretrained Models: Utilising pretrained models, especially in computer vision, can alleviate the challenges of training deep learning models from scratch. Fine-tuning these models on the offensive content dataset can help achieve better performance with limited computational resources. 4. Regular Evaluation and Iteration: Continuously evaluating the performance of the offensive multimodal post-detection system is crucial. Identifying weaknesses, analysing misclassifications, and fine-tuning the model based on feedback and evaluation results can lead to continuous improvement. 6. Discussion and Conclusion 6.1 Summary of the Internship Experience The internship focused on developing an offensive multimodal post-detection system using machine learning techniques. Throughout the internship, several key aspects were addressed, including data collection, pre-processing, model design, and implementation. The challenges faced during the internship were also tackled through various strategies and solutions. The internship began with a thorough literature review, exploring existing research on offensive content detection and multimodal frameworks. This provided valuable insights into the state of the field and informed the design of the offensive multimodal post-detection system. Data collection was a critical step in the internship, involving the identification and collection of diverse offensive and non-offensive posts across multiple modalities. Ethical considerations and data usage policies were adhered to throughout the data collection process. Pre-processing techniques were employed to ensure data quality and consistency across modalities. Text data underwent normalisation and feature extraction, while images and videos were resized and normalised for effective model training. The architecture design for multimodal detection involved the integration of separate branches for each modality, fusion techniques, and attention mechanisms to capture the interactions between modalities and enhance the system's performance. Text embedding techniques, such as word embeddings and pretrained language models, were used to represent and analyse textual content. Computer vision techniques, including convolutional neural networks and transfer learning, were employed for image and video analysis. Throughout the internship, challenges related to data collection, labelling, modality integration, and model training were encountered. Strategies such as collaborative data collection, annotator training, fusion techniques, and transfer learning were employed to overcome these challenges. 6.2 Conclusion The internship on offensive multimodal post-detection using machine learning was a valuable learning experience. The internship successfully addressed the objectives and scope outlined in the introduction, including data collection, pre-processing, model design, and implementation. By conducting a literature review, valuable insights from previous research and existing multimodal frameworks were incorporated into the offensive multimodal postdetection system. The challenges encountered during the internship were addressed through collaborative efforts, careful design, and continuous evaluation and iteration. These challenges, including data collection, labelling, modality integration, and model training, were overcome through various strategies and solutions. Overall, the internship contributed to the development of an effective offensive multimodal post-detection system. The system leveraged machine learning techniques, multimodal fusion, and attention mechanisms to accurately identify offensive content across text, images, and videos. By combining the knowledge and experience gained during the internship, it is expected that the offensive multimodal post detection system will make a significant contribution to the field of content moderation, fostering safer online environments and enhancing user experiences.