Uploaded by Hate speech

Internship report

Department of Computer Science
Lok Nayak Jai Prakash Institute of Technology Chapra-84130
Internship Report
On
“Offensive Multimodal Post Detection Using Machine Learning”
Done at
Indian Institute of Technology Patna
Submitted in partial fulfillment of the requirements of awarding B-Tech degree in
Computer Science & Engineering.
By
Rohit Gupta
Registration no. – 19105117030
DECLARATION
I Rohit Gupta Registration No.- 19105117030, student of B-tech CSE 2019-2023, Lok
Nayak Jai Prakash Institute of Technology, Chapra hereby declare that this report is
a bonafide record of the organisation study did under the supervision of Dr. Krishanu
maity of IIT Patna from 12th of December 2022 to 12th of February 2023 on topic of
“Offensive Multimodal Post Detection Using Machine Learning”.
This work has not been under taken or submitted elsewhere in connection
with any other academic course.
Place: Chapra
Date:
Rohit Gupta
Registration no – 19105117030
Computer Science Engineering
Acknowledgement
I kindly take this opportunity to express my sincere expression of gratitude to each
and every person who helped me in the completion of this work. I am very glad to
express my gratitude to Dr. Krishanu Maity for his timely help all through my study
and for his valuable suggestions, advice, and encouragement throughout the
Internship. I owe my gratitude to Dr. Chanchal Suman for her constant support and
encouragement during the internship. I am also grateful to my parents and friends for
their help throughout the process.
About The Internship
During my internship at IIT Patna from December 12th to February 12th, I worked
under the guidance of Krishanu Maity on the topic of "Offensive Multimodal Post
Detection Using Machine Learning." The objective of the internship was to develop
an automated system capable of detecting offensive content in various forms, such
as text, images, and videos, using machine learning techniques.
Throughout the internship, I was exposed to a wide range of tasks and
responsibilities related to the project. I began by familiarising myself with the existing
literature and research in the field of offensive content detection. This involved
studying relevant papers, articles, and existing frameworks to gain a comprehensive
understanding of the subject matter.
After the initial research phase, I collaborated closely with Krishanu Maity and other
team members to design and implement a multimodal deep learning model for
offensive content detection. This involved collecting and preprocessing a large
dataset of offensive and non-offensive content, which was crucial for training and
evaluating the model's performance.
I worked extensively with various machine learning techniques, such as natural
language processing (NLP), computer vision, and deep learning algorithms. I
employed tools like moviepy, pytube, and Scikit-Learn to develop and fine-tune the
model architecture. This included creating text embedding models, and convolutional
neural networks (CNNs) for image analysis.
To ensure the model's accuracy, conducted a performance analysis and compared
the results with existing state-of-the-art methods to validate the effectiveness of our
proposed approach.
In addition to the technical aspects, I actively participated in team meetings,
presenting my progress, discussing challenges, and brainstorming solutions with the
guidance of Krishanu Maity and other researchers. This collaborative environment
allowed me to enhance my communication and teamwork skills while also gaining
insights into the broader research community.
INDEX
1. Introduction
o
Background and motivation
o
Objectives and scope
o
Methodology overview
2. Literature Review
o
Overview of offensive content detection
o
Previous research and approaches
o
Existing multimodal detection frameworks
3. Data Collection and Preprocessing
o
Data collection process
o
Data preprocessing techniques
o
Dataset description
4. Model Design and Implementation
o
Decision Tree
o
Naive Bayes classifiers
o
Support Vector Machine
o
Multi-Layer Perceptron (MLP)
o
Model Optimisation
5. Challenges and Solutions
o
Challenges encountered during the internship
o
Strategies and solutions employed
6. Discussion and Conclusion
o
Summary of the internship experience
1. INTRODUCTION
1.1 Background and Motivation
The field of machine learning has experienced significant advancements in recent
years, revolutionising various industries and domains. With its ability to analyse vast
amounts of data and extract valuable insights, machine learning has become a
powerful tool for solving complex problems. Recognising the potential of machine
learning, I undertook an internship to gain practical experience in this field and
explore its applications in offensive content detection.
The proliferation of social media platforms and online communities has led to an
increase in offensive and harmful content, posing challenges for content moderation.
Offensive content, such as hate speech, cyberbullying, and harassment, can have
detrimental effects on individuals and communities. Therefore, there is a growing
need for automated systems capable of detecting and mitigating such content.
1.2 Objectives and Scope
The primary objective of my internship was to develop an offensive content detection
system using machine learning techniques. The goal was to create an automated
solution capable of analysing various forms of content, including text, images, and
videos, to accurately identify and flag offensive postsable of analysing various forms
of content, including text, images, and videos, to accurately identify and flag offensive
posts. By achieving this objective, the internship aimed to contribute to the broader
field of content moderation and facilitate the creation of safer online environments.
The scope of the internship involved conducting research on existing approaches and
algorithms for offensive content detection. It also included designing and
implementing a machine learning model that leverages multimodal data to improve
the accuracy of detection. The internship further encompassed collecting and preprocessing a suitable dataset, training and evaluating the model, and analysing its
performance against existing state-of-the-art methods.
1.3 Methodology Overview
The methodology adopted for the internship involved a systematic and iterative
approach to achieving the project objectives. It started with an extensive literature
review to gain a comprehensive understanding of offensive content detection and the
relevant machine learning techniques.
2. LITERATURE REVIEW
2.1 Overview of Offensive Content Detection
Offensive content detection is a critical task in the field of content moderation, aimed
at identifying and mitigating harmful or offensive posts across various online
platforms. This section provides an overview of offensive content detection
techniques and their significance in creating safer online environments.
Offensive content can take different forms, including hate speech, cyberbullying,
harassment, and explicit or violent imagery. Traditional approaches to content
moderation relied on manual review and user reporting, which proved to be timeconsuming and inefficient. The advent of machine learning techniques has enabled
the development of automated systems capable of detecting offensive content at
scale.
2.2 Previous Research and Approaches
Numerous research studies have been conducted on offensive content detection,
focusing on both unimodal (text, image, or video-based) and multimodal approaches.
This subsection explores the key findings and methodologies employed in previous
research efforts.
In text-based offensive content detection, approaches often involve natural language
processing (NLP) techniques. Various algorithms, such as recurrent neural networks
(RNNs), long short-term memory (LSTM), and transformers like BERT (Bidirectional
Encoder Representations from Transformers), have been used for sentiment
analysis, profanity detection, and context-based classification of offensive text.
Image-based offensive content detection primarily relies on computer vision
techniques. Convolutional neural networks (CNNs) have been widely used for feature
extraction and classification of offensive or explicit imagery. Transfer learning
approaches, utilising pre-trained models like VGGNet, ResNet, or InceptionNet, have
shown promising results in image-based offensive content detection.
Video-based offensive content detection poses additional challenges due to the
temporal nature of the data. Approaches in this domain often involve frame-level
analysis and sequential modelling using recurrent neural networks (RNNs) or 3D
convolutional neural networks (3D CNNs). Spatio-temporal feature extraction and
fusion techniques have also been explored to capture both visual and temporal cues
in video content.
2.3 Existing Multimodal Detection Frameworks
Multimodal offensive content detection leverages information from multiple
modalities, such as text, images, and videos, to improve the accuracy and
robustness of the detection system. This subsection discusses existing multimodal
detection frameworks and their contributions.
Multimodal approaches typically combine unimodal models, such as text classifiers,
image classifiers, and video classifiers, into a unified framework. These models are
designed to capture complementary features from different modalities and make joint
predictions. Fusion techniques, such as late fusion (combining predictions at the
decision level) or early fusion (combining features at the input level), are commonly
used to integrate information from multiple modalities.
Additionally, multimodal frameworks often explore the use of attention mechanisms to
dynamically weigh the importance of different modalities or specific regions within
modalities. Attention mechanisms enable the model to focus on relevant information
and improve the overall performance of the offensive content detection system.
Several studies have demonstrated the effectiveness of multimodal detection
frameworks, showcasing improved performance compared to unimodal approaches.
These frameworks have the potential to provide a comprehensive and nuanced
understanding of offensive content, leveraging the strengths of different modalities.
By reviewing the literature on offensive content detection, previous research efforts,
and existing multimodal detection frameworks, valuable insights can be gained to
inform the design and implementation of the offensive multimodal post detection
system in the internship. These findings contribute to the development of an effective
and state-of-the-art solution in the field of content moderation.
3. Data Collection and Pre-processing
3.1 Data Collection Process
The success of offensive multimodal post detection relies heavily on the availability of
a diverse and representative dataset. This section discusses the data collection
process undertaken for the internship project.
The data collection process involved several steps. Firstly, a thorough review of
existing datasets related to offensive content detection was conducted. Various
publicly available datasets, such as Hate Speech and Offensive Language (HSOL),
Multimodal Offensive Content Dataset (MOC), and YouTube dataset, were
considered. Additionally, specific platforms or social media websites that are known
for offensive content were explored to collect data.
After identifying potential datasets, the necessary permissions and ethical
considerations were addressed to ensure compliance with data usage policies and
user privacy. In some cases, collaboration with platform administrators or obtaining
consent from users was necessary.
Next, a systematic approach was employed to gather offensive and non-offensive
posts across multiple modalities. This involved leveraging specific keywords,
hashtags, or user-defined labels to retrieve relevant content. Additionally, the posts
were collected from diverse sources to capture variations in content types, contexts,
and user demographics.
It's important to note that during the data collection process, careful consideration
was given to ethical guidelines and the potential impact of handling offensive content.
Measures were taken to maintain the privacy and anonymity of individuals while
preserving the integrity of the data.
3.2 Data Preprocessing Techniques
Once the data was collected, it underwent preprocessing to ensure data quality,
consistency, and compatibility for subsequent analysis and model training. This
subsection highlights the data preprocessing techniques employed.
For text data, common preprocessing steps included removing special characters,
punctuation, and URLs. Text normalisation techniques, such as lowercasing,
stemming, or lemmatization, were applied to reduce vocabulary size and improve
generalisation. Stop words and irrelevant terms specific to the dataset were also
eliminated to focus on meaningful content.
In the case of image data, pre-processing involved resizing images to a uniform size
to ensure compatibility across the dataset. Techniques like centre cropping or
resizing with aspect ratio preservation were employed to maintain the integrity of the
visual content. Additionally, normalisation techniques, such as scaling pixel values to
a specific range or using a precomputed mean and standard deviation, were applied
to facilitate effective model training.
Video data underwent pre-processing steps to extract relevant frames or keyframes
from the videos. Keyframes represent representative frames that capture important
visual information. The keyframes were extracted using techniques such as uniform
sampling or motion analysis. Similar to image data, resizing and normalisation
techniques were applied to ensure consistency and compatibility.
3.3 Dataset Description
The dataset used for offensive multimodal post detection comprised offensive and
non-offensive posts across multiple modalities, including text, images, and videos.
The dataset was carefully curated and labelled to include a wide range of offensive
content types, such as hate speech, cyberbullying, or explicit imagery.
The dataset consisted of a significant number of posts from diverse sources,
including social media platforms, online forums, and multimedia sharing websites.
The dataset aimed to capture variations in offensive content based on language,
cultural context, and geographical locations.
To maintain data integrity and ensure accurate labelling, a rigorous annotation
process was undertaken. Human annotators with expertise in content moderation
were involved in labelling the posts as offensive or non-offensive. In cases where the
classification of offensive content required further granularity, additional labels or
severity levels were assigned.
The dataset was divided into appropriate subsets for training, validation, and testing,
adhering to standard practises to evaluate the model's performance accurately.
Stratified sampling or random sampling techniques were employed to ensure a
balanced representation of offensive and non-offensive posts across the subsets.
4. Data Modeling
Data modeling in machine learning refers to the process of creating a mathematical
representation or model that captures the underlying patterns, relationships, and
characteristics of a dataset. This model is then used to make predictions, classify new data
points, or gain insights from the data.
4.1 Decision Tree Modeling
A decision tree is one of the most powerful tools of supervised learning algorithms
used for both classification and regression tasks. It builds a flowchart-like tree
structure where each internal node denotes a test on an attribute, each branch
represents an outcome of the test, and each leaf node (terminal node) holds a class
label. It is constructed by recursively splitting the training data into subsets based
on the values of the attributes until a stopping criterion is met, such as the
maximum depth of the tree or the minimum number of samples required to split a
node.
During training, the Decision Tree algorithm selects the best attribute to split the
data based on a metric such as entropy or Gini impurity, which measures the level
of impurity or randomness in the subsets. The goal is to find the attribute that
maximizes the information gain or the reduction in impurity after the split.
A tree can be “learned” by splitting the source set into subsets based on Attribute
Selection Measures. Attribute selection measure (ASM) is a criterion used in
decision tree algorithms to evaluate the usefulness of different attributes for splitting
a dataset. The goal of ASM is to identify the attribute that will create the most
homogeneous subsets of data after the split, thereby maximizing the information
gain. This process is repeated on each derived subset in a recursive manner
called recursive partitioning. The recursion is completed when the subset at a node
all has the same value of the target variable, or when splitting no longer adds value
to the predictions. The construction of a decision tree classifier does not require any
domain knowledge or parameter setting and therefore is appropriate for exploratory
knowledge discovery. Decision trees can handle high-dimensional data.

Decision Tree Model accuracy and Precision of the Model
4.2 Naive Bayes Classifiers
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’
Theorem.
Bayes’ Theorem
Bayes’ Theorem finds the probability of an event occurring given the probability of
another event that has already occurred. Bayes’ theorem is stated mathematically
as the following equation:
where A and B are events and P(B) ≠ 0.

Basically, we are trying to find probability of event A, given the event B is true.
Event B is also termed as evidence.

P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is
seen). The evidence is an attribute value of an unknown instance(here, it is event B).

P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen.

Naïve Bayes Classification Model accuracy and Precision of the Model
4.3 Support Vector Machine
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which
is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine.

Support Vector Machine Model accuracy and Precision of the Model
4.4 Neural Network Model-Multi-layer Perceptron (MLP)
A multi-layer perceptron has one input layer and for each input, there is one
neuron(or node), it has one output layer with a single node for each output and it
can have any number of hidden layers and each hidden layer can have any number
of nodes. A schematic diagram of a Multi-Layer Perceptron (MLP) is depicted
below.
In the multi-layer perceptron diagram above, we can see that there are three inputs
and thus three input nodes and the hidden layer has three nodes. The output layer
gives two outputs, therefore there are two output nodes. The nodes in the input
layer take input and forward it for further process, in the diagram above the nodes
in the input layer forwards their output to each of the three nodes in the hidden
layer, and in the same way, the hidden layer processes the information and passes
it to the output layer.

Neural Network Model-Multi-layer Perceptron(MLP) Model accuracy and
Precision of the Model
4.5 Model Optimisation
5. Challenges and Solutions
5.1 Challenges Encountered During the Internship
During the internship focused on offensive multimodal post detection using machine
learning, several challenges may have arisen. These challenges could include:
1. Data Collection: Gathering a diverse and representative dataset of offensive and nonoffensive posts across multiple modalities can be a challenging task. Finding suitable
sources, obtaining permissions, and ensuring data privacy can pose difficulties.
2. Labelling and Annotation: Accurately labelling offensive and non-offensive posts in
the dataset requires human annotators with expertise in content moderation.
Maintaining consistency and dealing with subjective content can be challenging.
3. Modality Integration: Integrating multiple modalities, such as text, images, and
videos, to create an effective multimodal detection system can be complex.
Combining different types of data and capturing their interactions in a unified model
requires careful design.
4. Model Training and Optimisation: Training deep learning models on multimodal data
requires significant computational resources and careful hyperparameter tuning.
Managing training times, avoiding overfitting, and achieving a balance between
modalities can be challenging.
5.2 Strategies and Solutions Employed
To address these challenges, various strategies and solutions can be employed:
1. Collaborative Data Collection: Collaborating with platform administrators,
researchers, or data providers can help overcome data collection challenges.
Establishing partnerships and leveraging existing datasets or data-sharing initiatives
can ensure a more diverse and comprehensive dataset.
2. Annotator Training and Guidelines: Providing detailed guidelines and conducting
training sessions for annotators can help improve consistency and accuracy in
labelling offensive content. Regular communication and feedback with annotators can
address challenges and ensure high-quality annotations.
3. Transfer Learning and Pretrained Models: Utilising pretrained models, especially in
computer vision, can alleviate the challenges of training deep learning models from
scratch. Fine-tuning these models on the offensive content dataset can help achieve
better performance with limited computational resources.
4. Regular Evaluation and Iteration: Continuously evaluating the performance of the
offensive multimodal post-detection system is crucial. Identifying weaknesses,
analysing misclassifications, and fine-tuning the model based on feedback and
evaluation results can lead to continuous improvement.
6. Discussion and Conclusion
6.1 Summary of the Internship Experience
The internship focused on developing an offensive multimodal post-detection system
using machine learning techniques. Throughout the internship, several key aspects
were addressed, including data collection, pre-processing, model design, and
implementation. The challenges faced during the internship were also tackled
through various strategies and solutions.
The internship began with a thorough literature review, exploring existing research on
offensive content detection and multimodal frameworks. This provided valuable
insights into the state of the field and informed the design of the offensive multimodal
post-detection system.
Data collection was a critical step in the internship, involving the identification and
collection of diverse offensive and non-offensive posts across multiple modalities.
Ethical considerations and data usage policies were adhered to throughout the data
collection process.
Pre-processing techniques were employed to ensure data quality and consistency
across modalities. Text data underwent normalisation and feature extraction, while
images and videos were resized and normalised for effective model training.
The architecture design for multimodal detection involved the integration of separate
branches for each modality, fusion techniques, and attention mechanisms to capture
the interactions between modalities and enhance the system's performance.
Text embedding techniques, such as word embeddings and pretrained language
models, were used to represent and analyse textual content. Computer vision
techniques, including convolutional neural networks and transfer learning, were
employed for image and video analysis.
Throughout the internship, challenges related to data collection, labelling, modality
integration, and model training were encountered. Strategies such as collaborative
data collection, annotator training, fusion techniques, and transfer learning were
employed to overcome these challenges.
6.2 Conclusion
The internship on offensive multimodal post-detection using machine learning was a
valuable learning experience. The internship successfully addressed the objectives
and scope outlined in the introduction, including data collection, pre-processing,
model design, and implementation.
By conducting a literature review, valuable insights from previous research and
existing multimodal frameworks were incorporated into the offensive multimodal postdetection system.
The challenges encountered during the internship were addressed through
collaborative efforts, careful design, and continuous evaluation and iteration. These
challenges, including data collection, labelling, modality integration, and model
training, were overcome through various strategies and solutions.
Overall, the internship contributed to the development of an effective offensive
multimodal post-detection system. The system leveraged machine learning
techniques, multimodal fusion, and attention mechanisms to accurately identify
offensive content across text, images, and videos.
By combining the knowledge and experience gained during the internship, it is
expected that the offensive multimodal post detection system will make a significant
contribution to the field of content moderation, fostering safer online environments
and enhancing user experiences.