Uploaded by Themba Ncwane

Cyberbullying Detection NLP Thesis-Proposal

advertisement
CYBERBULLYING DETECTION WITH
NATURAL LANGUAGE PROCESSING ON
SOCIAL MEDIA
by
Sonwabile Balite
0109285168083
A research proposal is submitted in partial fulfilment of the requirements for the
bachelor's degree in
Computing
The Belgium Campus
Supervisor: Anila Joy
25/05/2022
Table of Contents
1
INTRODUCTION AND BACKGROUND ..........................................................................................................2
1.1 Introduction ...............................................................................................................................................2
1.2 Background ................................................................................................................................................2
2
RESEARCH PROBLEM...................................................................................................................................3
3
AIM, OBJECTIVES and RESEARCH QUESTION ..............................................................................................4
3.1
AIM .......................................................................................................................................................4
3.2
OBJECTIVE .............................................................................................................................................5
3.3
RESEARCH QUESTIONS .........................................................................................................................5
3.3.1
Main Question ..................................................................................................................................5
3.3.2 Sub Question...........................................................................................................................................5
4
PRELIMINARY LITERATURE REVIEW ............................................................................................................6
4.1
Theoretical background/conceptual framework..................................................................................6
4.2 Related work/studies .................................................................................................................................7
5
DESIGN, METHODOLOGY AND ETHICS......................................................................................................8
5.1
RESEARCH APPROACH ..........................................................................................................................8
5.2
METHODOLOGICAL CHOICE .................................................................................................................9
5.3
RESEARCH STRATEGY......................................................................................................................... 10
5.4
RESEARCH DESIGN ............................................................................................................................. 11
5.4.1
Data collection ............................................................................................................................... 11
5.4.2
Population and sample .................................................................................................................. 11
5.4.3
Experimentation, Evaluation and Interpretation of the results .................................................... 11
5.5
TIME HORIZON................................................................................................................................... 12
5.6
ETHICS ................................................................................................................................................ 13
6
DELINEATIONS, TIMELINE, BUDGET AND LIMITATIONS .......................................................................... 13
7
ASSUMPTIONS ......................................................................................................................................... 14
8
OUTCOMES, CONTRIBUTION AND SIGNIFICANCE ................................................................................... 14
9
CONCLUSION............................................................................................................................................ 15
1
References ............................................................................................................................................................. 17
Research Proposal
1
INTRODUCTION AND BACKGROUND
1.1 INTRODUCTION
The abundance of electronic devices that can connect to the internet in today’s day and age has made
it possible for bullying to move into the realm of technology, as is now known as cyberbullying.
Cyberbullying affects many people negatively daily, especially a lot of children and adolescents. Thus,
concerns about cyberbullying are rising.
A lot of research is taking place in attempts of diminishing cyberbullying and the effects that it has on
its victims. A lot of previous research has delved into the impacts had on mental health from
cyberbullying, and more recently the order of the day has been to investigate more on being able to
recognize cyberbullying in many different languages and in colloquial forms. There are many ways to
identify cyberbullying, but the most effective methods include machine learning techniques,
particularly natural language processing (NLP). This research proposal will go into more depth and
propose the research that needs to be done on identifying cyberbullying on social media with natural
language processing (NLP). It will go into detail about the research problem, the aim and objectives of
the research, existing research that has been done pertaining to this issue, the design and
methodology of the research that will be carried out and finally the contribution and significance of
this research.
1.2 BACKGROUND
The manner in which technology advances and changes on a daily basis is very rapid, and it is now
changing the interactions and relationships we have with people. With communication being a mere
mouse click or swipe of a screen away (Weber & William V. Pelfrey, 2014). The communicative platforms
used, and social networks used for such communication include Skype, email, Facebook, Twitter, and
Instagram, amongst many others. Platforms like this allow us to share pictures, recorded videos, and
aspects of our daily lives, and not only with the people in our world, but with people all over the world.
It has become a daily aspect of everyday life (Sue North & Snyder, 2008).
People, teenagers in particular use social networks to connect with others as well as for passing time
with different entertainment venues (Boyd, 2009). Cyberbullying and social network use does not
necessarily only affect a certain demographic however, research suggests that adolescents and young
2
adults are more likely to experience cyberbullying because the amount of exposure they have to social
media is often more than that of other age groups (Ito, et al., 2008). This extended exposure to social
networks can unfortunately make it a space for spreading rumours, social sabotage, and
humiliation(cyberbullying) (Hinduja & Patchin, 2008).
Bullying in South African schools has in previous years become notorious for gaining a lot of media
consideration (Smit, 2015). However, studies on bullying in the workplace in South Africa are scarce
(Wet, 2011). Nonetheless, bullying often results in emotions of "incompetence, alienation and
depression" (Roux, et al., 2010); and in schools, the research suggests that cyberbullying might bring
about "low confidence, family issues, scholarly issues, school brutality and suicidal thoughts” (Goodno,
2011). This shows that it is imperative for there to be a solution that aims to combat cyberbullying. One
of the methods to achieve this is by implementing machine learning techniques, such as natural
language processing to identify cyberbullying on social media, effectively helping to reduce or
completely diminish the cyberbullying (Dadvar, et al., 2014) (Hee, et al., 2018) , hence research is
necessary. This research proposal aims to propose and plan for the necessary research.
2
RESEARCH PROBLEM
Cyberbullying can be defined as harassment/bullying that takes place over digital devices such as cell phones,
computers, and tablets. The medium cyberbullying is usually found in, includes SMS, Emails, text
messages/direct messages over the internet, forums, and gaming conference rooms such as Reddit, and on
social media sites such as Facebook, Instagram, Snapchat and TikTok (Stop Bullying.Gov, 2021). Cyberbullying
can include behaviours such as sending, posting, or sharing destructive and harmful content about someone
else (Stop Bullying.Gov, 2021). Behaviours that fall under cyberbullying include (Gordon, 2022):
•
•
•
•
•
Harassment
Impersonation
Inappropriate photographs
Video shaming
Website creation
For this research proposal, the words “teenager(s)” and “children(s)” will be used interchangeably. Focus will
be placed on teenagers, because they are the demographic that is most exposed to cyberbullying (Amanda
Lenhart, 2007).
3
In South Africa on average, a staggering 51.5% of teenagers have been cyberbullied, 54% of teenagers
accessed inappropriate content via digital platforms, 35% of teenagers have been a victim of cyberstalking
and 36.5% of teenagers have fallen victim to online shaming (Pheto, 2021) (News24, 2021). This indicates
that more than half of the teenage population in South Africa have experienced cyberbullying. This is an issue
because it affects the teenage population of South Africa negatively in ways such as making them feel
distressed, humiliated, depression, anxiety, and low self-esteem. They are even affected by the
behaviourisms that usually ensue after an occurrence of cyberbullying, which include self-harm, poor
academic performance, skipping school, substance abuse and suicidal thoughts (Gordon, 2021) (LopezMeneses, et al., 2020). In just one of the provinces in South Africa, Limpopo, a cyberbullying study was
conducted, and it reported that 31% of victimised students reported being very or extremely upset, 19%
were very or extremely scared and 18% were extremely embarrassed after a cyberbullying episode, this
proves that the effects of cyberbullying are not good and it goes to show that if something is not done to
tackle an episode(s) of cyberbullying as soon as possible children might feel so desperate as to resort to selfharm or commit suicide (Farhangpour, et al., 2019). The effects of cyberbullying are extremely negative and
will only continue and even become worse should nothing be done to deal with it (Navarro, et al., 2016).
Problem Statement:
Since the advent of the internet and distribution of digital devices amongst the youth, traditional bullying
has taken an online form called cyberbullying. Teenagers and children are more likely to be exposed to this
cyberbullying due to their (when compared to other age groups) frequent exposure to social media, texting
applications and other technologies where cyberbullying can take place, and this has an extremely negative
effect on these children’s and teenagers psychological, physical, emotional, and mental well-being.
3
AIM, OBJECTIVES and RESEARCH QUESTION
3.1 AIM
The aim of the proposed research is to determine if the detection of cyberbullying on social media with
the use of natural language processing is possible and if so, how it can be implemented. It is also
important to determine if the detection of cyberbullying on social media with the use of natural
language processing can be implemented how would such a feature change the effects of cyberbullying
on the psychological, physical, emotional, and mental well-being of teenagers and victims of
cyberbullying in general.
4
3.2 OBJECTIVE
The main objective of this research is to eventually develop a systematic approach to automatically
detect and classify cyberbullying entries on social media, which could help prevent, detect, and
ultimately solve the problem of cyberbullying.
Sub-objectives include:
•
•
•
•
•
•
An objective of the research is to contribute to the prevention of the cyberbullying problem by
transferring and testing the developed cyberbullying detection methods on mobile devices.
An objective of the research is identifying cyberbullying text on social media using natural
language processing and understanding its meaning and context.
An objective of the research is figuring out the best implementation of natural language
processing to detect cyberbullying to such an extent that it helps to prevent, reduce, or solve
the cyberbullying problem.
An objective of the research is to investigate how the detection, prevention, and reduction of
cyberbullying on social media affects the psychological, physical, emotional, and mental wellbeing of the youth and effectively every other social media user, and improve the overall
satisfaction levels of these.
An objective of the research is identifying what classifies as social media and on which
platform an ‘automatic cyberbullying detection system’ would work best on.
An objective of the research is identifying if the chosen social media platform(s) that the
automatic cyberbullying detection system will operate on will be sufficient to gauge the
effectiveness of the system.
3.3 RESEARCH QUESTIONS
3.3.1
MAIN QUESTION
How can a natural language processing system be successfully implemented to detect
cyberbullying in its context on social media in order to prevent, stop or reduce further cyberbullying
from happening and its negative effects on social media users going forward?
3.3.2 SUB QUESTION
Sub-questions:
•
•
•
How can NLP techniques be implemented in a way that effectively detects and comprehends
cyberbullying incidences found online in their context?
Can the successful implementation of a NLP system that detects cyberbullying on social
media be effective enough to reduce the high levels of psychological, physical, emotional,
and mental health dissatisfaction seen amongst cyberbullying victims and regular social
media users?
Which social media platform(s) experience regular cyberbullying incidents to be appropriate
enough to test the NLP system on?
5
•
•
•
•
4
Are the users of the chosen social media platform(s) accessible enough so that the change in
their psychological, physical, emotional, and mental health satisfaction levels can be used as
a standard of measure to test the effectiveness of the NLP system?
Is it ethically moral to use the change in psychological, physical, emotional, and mental health
satisfaction levels of cyberbullying victims and regular social media users as a standard of
measure to test the performance of the NLP system? If not, what is an accurate, stable, and
precise standard of measure of performance for the NLP system?
How will the NLP cyberbullying detection system navigate around ‘slang’ or colloquial terms
which fall under cyberbullying found on social media but are not well known amongst other
users?
How can the NLP cyberbullying detection system still be effective when cyberbullying terms
involve multiple languages or language varieties in the context of a single sentence/
conversation also known as ‘code-switching’ in a cyberbullying context?
PRELIMINARY LITERATURE REVIEW
Cyberbullying research has been very concentrated on causes and the social and psychological effects it can
have on a person; however, it is only recently that more consideration has been given to the automatic
detection of cyberbullying statements and incidents on social media (Capua, et al., 2016). The majority of
the existing research on automatic cyberbullying detection on social media make use of the conventional
Machine Learning concepts/ models. These machine learning models and data modelling techniques include
deep neural network-based models (Zhang, et al., 2016), support vector machines (Yin, et al., 2009),
language-based approaches (Kontostathis, et al., 2013), and data modelling methods such as Random Forest,
Naive Bayes, K-Nearest Neighbour, amongst others.
4.1 THEORETICAL BACKGROUND/CONCEPTUAL FRAMEWORK
The above section “Chapter 2 – Research problem” went into detail about the current effects and
consequences of cyberbullying on regular social media users. This suggests that any measure to stop, prevent
or reduce cyberbullying is necessary and this includes the automatic detection of cyberbullying utilising
various machine learning techniques such as natural language processing are very necessary and important,
however, the existing datasets and accurately efficient machine learning models are limited (Alotaibi, et al.,
2021). However, before this is further discussed, the importance of firstly grasping a conceptual framework
or theoretical background of the common machine learning techniques and data processing models is
extremely necessary.
4.1.1. Datasets and Data Gathering
There are a variety of strategies and techniques that have been used for data gathering in the datasets,
required by Cyberbullying detection models, the majority of which are centred around ‘textual’ data/
data in the form of text (Gao & Huang, 2017). Some dataset data includes multimedia, such as ‘gifs’,
‘pictures’ and ‘voice recordings’, how these are a lot less in contrast to datasets with text
6
4.1.2.
4.1.3.
4.1.4.
4.1.5.
4.1.6.
(Hosseinmardi, et al., 2015). The process for gathering such data and processing so that the
cyberbullying model is accurate is an extremely tedious and inefficient action (Nobata, et al., 2016).
Basic Features
This refers to the ‘baseline’ natural language processing methods such as ‘Bag of words’ – which
includes a cluster of known words that are usually associated with cyberbullying incidents. These
methods are not as basic as they seem. They can be compared to n-gram and token n-gram frequencybased methods (Gao & Huang, 2017).
Sentiment Analysis
Sentiment is a natural language processing technique used to determine whether data is positive,
negative, or neutral. Negative sentiment and hate speech have been proven to be correlated (Schmidt
& Wiegand, 2017). Sentiment analysis can be used by the cyberbullying detection system to classify if
text is negative - which could link to cyberbullying - or positive (Gitari, et al., 2015).
Lexical Resources
A lexical resource refers to the use of one’s words. According to previous research work done there is
an association between a specific list of words and the existence of obscene/ sacrilegious content in
textual bodies (Schmidt & Wiegand, 2017). This means that lexical resources can be used to reinforce
Cyberbullying detection. This is often used in addition to other detection techniques.
Linguistic features
Linguistic features refer to the features related to how words are pronounced and arranged in a
sentence, as well as what words are used. The idea behind linguistic features and the research projects
that have used this is modelling the deeper and higher-level semantic relationship between words in
sentences. Such functionality can be achieved by building dependency rule engines or more commonly
using statistical approaches (Gitari, et al., 2015) (Burnap & Williams, 2015).
Beyond the Textual context
Rather than analysing the textual posts of social media, researchers have considered and taken interest
in the complementary and meta-information around the textual body found on a social media post.
Some research has shown that meta-features/ multimedia could be extracted from social media
content. Such techniques make use of image recognition tactics, such as Support Vector Machines
(SVMs) and Convolutional Neural networks (CNNs) for classification and prediction (Huang & Sushkov,
2016). Some methods have even combined textual, visual, and audio features of social media content
for inappropriate content detection (Soni & Singh, 2018). This is quite helpful in the detection of
cyberbullying.
4.2 RELATED WORK/STUDIES
Traditional studies on cyberbullying focused on the act itself, including the statistics of cyberbullying, definitions,
and mainly the negative impacts of cyberbullying. (Patchin & Hinduja, 2006). Not a lot of work has been done on
the automation of cyberbullying detection on social media. The work that has been done however, include:
(Hee, et al., 2015) who proposed a strategy for distinguishing the more subtle types of cyberbullying, like insults
and threats. The authors classified the probable subjects into three classes: harasser, victim, and bystander. The
7
bystander class was separated into two classes: the individuals who defend the victim and the individuals who
support the harasser. Support vector machines (SVMs) were then used to isolate the comments.
(Sanchez & Kumar, 2011) who were one of the first to propose a technique to identify cyberbullying on the
Twitter social media platform. The authors used the Naïve Bayes classifier to detect tweets that contained
harmful behaviour toward a particular gender. Nonetheless, their strategy accomplished just an exactness of
70% and it should be noted that the size of the utilized dataset was relatively small.
(Al-Garadi, et al., 2016) provided an approach for detecting cyberbullying in Twitter that uses many of the unique
properties the Twitter platform has. For categorization, these attributes were given into a machine learning
algorithm along with their corresponding samples. The authors looked at four machine learning algorithms,
including Random Forest (RF), Naïve Bayes (NB), Support Vector Machines (SVMs), and K-Nearest Neighbours
(KNNs), and discovered that RF is the top performer.
(Kowalski, et al., 2012) took labelled data from a source and applied two models: language and machine learning,
to create a very successful query application for efficient detection of cyberbullying incidents. The phrases
created by the machine learning model were able to perform better than the language model in terms of recall
and precision.
Another related study has shown that initial research on cyberbullying detection algorithms focused mostly on
the context of conversations, rather than the characteristics of the cyberbullying actors. It further revealed that
men and women bully each other in different ways, stating that women, are more likely to use confrontational
communication strategies, while men are more likely to use words and phrases that threaten (Lee & Ma, 2012).
5
DESIGN, METHODOLOGY AND ETHICS
5.1 RESEARCH APPROACH
The research approach will be deductive in nature. The reasoning for this is that the research aims to
test an existing theory which is that cyberbullying has such an impact on social media users because of
their exposure to it on social media (Amanda Lenhart, 2007).
8
The problem statement backing this research approach was mentioned in “Chapter 2 – Research
problem” and the ‘hypothesis’ backing this research approach is that: decreased exposure to
cyberbullying on social media could positively impact the overall psychological, mental, emotional, and
physical well-being of cyberbullying victims and social media users. Deductive research aims at testing
an existing theory and in this case the existing theory can be deduced from the problem statement and
hypothesis. This theory for the research project implies that decreased or no exposure to cyberbullying
content on social media will statistically and in practice improve the emotional, physical, mental, and
psychological well-being on regular social media users and cyberbullying victims, which is what this
research project aims to do. Hence the choice of the deductive research approach.
5.2 METHODOLOGICAL CHOICE
For this research project a mixed method methodology has been chosen. Mixed methods research is an
approach involving collecting both quantitative and qualitative data and integrating the two forms of
data. The reasoning for this choice is because it is the best research approach to gauge the effectiveness
of the automatic cyberbullying detection system and test if the objectives of the research were
achieved.
For the quantitative part of the research, different surveys, and experiments to test how much
cyberbullying content is censored from the regular social media users experience will be conducted.
This is necessary to see if the cyberbullying detection system is effective enough in actively preventing/
completely stopping cyberbullying incidents from occurring, thus fulfilling one of the research
objectives.
For the qualitative part of the research, different interviews, focus groups and participant observation
studies will be conducted and focus will be on the overall well-being of the social media users where
the cyberbullying detection system has been implemented. This is necessary to see if the cyberbullying
detection system is effective in improving the psychological, mental, physical, and emotional wellbeing/ satisfaction levels of the social media users, thus fulfilling one of the research objectives.
Therefore, both quantitative and qualitative research approaches are necessary, to fulfil the objectives
of the research and to test the effectiveness of the system. Hence the mixed methods methodology
choice.
9
5.3 RESEARCH STRATEGY
The research strategy had to be chosen carefully for the research project, in ensure that the research
aligns with the aim of the project which is to determine if the detection of cyberbullying on social media
with the use of natural language processing is possible and if so, how it can be implemented. Another
part of the aim of the project is to determine if the detection of cyberbullying on social media with the
use of natural language processing can be implemented in a way that positively changes the effects that
cyberbullying has on the psychological, physical, emotional, and mental well-being of regular social
media users and victims of cyberbullying in general.
Considering all of this, the first step in the research strategy will involve a pipeline which extracts suitable
data from social media sites, and various online sources and the goal with this data will be to classify if a
remark or status/post from a social media user can be classified into a “cyberbullying’ category or not.
After data collection and extraction, the second step will involve the pre-processing and categorization
of the data. This includes activities such as noise reduction, lowercasing, lemmatization and discarding
spam content. “Cyberbullying” and “non-cyberbullying” categories will be made, and additional
information will be fed into the system about the data in these categories, because it improves the
natural language processing learning model (Sharma, et al., 2018). To increase the precision of the
model, the next step of the research would be feature engineering – which refers to the process of using
known knowledge to select and transform the relevant variables from raw data when creating a
predictive model using machine learning. Examples of possible features that the cyberbullying detection
system using NLP could entail are counting the number of offensive words in a sentence, counting the
number of positive words in a sentence, etc…
The final steps of the research project will be to implement the system on social media sites and evaluate
its performance. It will use its features for probabilistic classification and detecting of cyberbullying
content on social media. For the quantitative part of the research various accuracy metrics such as test
accuracy score, cross-validation score, etc... will be used to evaluate the performance of the system and
for the qualitative part of the research various surveys, interviews and focus groups will be implemented
to evaluate the performance of the cyberbullying detection system on their psychological, emotional,
mental, and physical well-being.
10
5.4 RESEARCH DESIGN
5.4.1
DATA COLLECTION
For the initial data collection phase of the research project the system will have to get raw data
sets. Data sets for cyberbullying usually consist of user comments, posts, images, and videos found
on and social media sites. There are multiple places to retrieve such data such as the UCI Machine
Learning Repository which encapsulates a large source of open-source datasets for data analysis
purposes (Sharma, et al., 2018). Other places to obtain this initial raw data are ‘Kaggle’ where
individuals and businesses contribute data for research purposes, FormSpring.Me, MySpace, the
Twitter API’s, – the streaming one of which, gives one access to all tweets as they publish on Twitter
– and from extracting comment threads from YouTube videos that are suspected to potentially
ignite hate speech.
This data is expected to provide the initial information that the cyberbullying system will use and
as these classify as actual data from their sources the data can be deemed as reliable.
5.4.2
POPULATION AND SAMPLE
The population field for this research will be all social media users. For the data collection and preprocessing stage all gathered data will be used. For the testing and evaluation stage, where the
psychological, physical, emotional, and mental well-being of social media users will be evaluated
and compared before and after the implementation of the cyberbullying detection system only a
sample of people in the adolescent age group will be used and tested on, as this seems to be the
group that is mostly affected by negative effects of cyberbullying because of their extended
exposure to social media, when compared to other age groups (Amanda Lenhart, 2007).
5.4.3
EXPERIMENTATION, EVALUATION AND INTERPRETATION OF THE RESULTS
All the raw data and information will then be used for the purpose of building a machine learning,
specifically a natural language processing model which will be referred to as the cyberbullying
detection system. Four classifiers/models will be trained to detect cyberbullying content. These
models being Logistic Regression, Support Vector Machine, Random Forest Classifier and Gradient
Boosting Machine.
For the quantitative part of the research the models will be provided with the training and test
datasets from the data collection phase and the evaluation of their performance will be measured
11
with train accuracy, test accuracy, AUC score and cross-validation scores. These test scores will be
analysed to see if the system has achieved its objectives.
The qualitative part of the research will be able to be tested in real time as the cyberbullying
detection system will have to be deployed on selected social media sites, which will require
adoption from the various social media site companies, such as Twitter and YouTube, or it can be
deployed locally as a ‘cyberbullying content blocker’ on one’s personal devices which they use to
browse social media sites. The qualitative part of the research can then be tested by conducting
surveys, interviews and focus group meetings to gauge the overall satisfaction level that the sample
group has for their psychological, mental, emotional, and physical well-being prior to the systems
deployment. The same surveys, interviews and focus group meetings will then be conducted after
the systems deployment and the satisfaction levels of the sample groups psychological, physical,
emotional, and mental well-being will be compared to those of before to test if the system has
achieved its objectives.
5.5 TIME HORIZON
The nature of the research project does not require a lot of time for the data collection and data
detection phases, it is the testing phase that will take the longest time. The research will take an
estimated 6 months to complete in its entirety. The first month will be used to gather enough data and
information to build the NLP model and its feature classification with and to conduct surveys, interviews
and focus group meetings on the adolescent sample group that are active on the social media platforms
where the cyberbullying detection and censoring system will be implemented. This will be to gather
data and information on their current psychological, emotional, mental, and physical well-being
satisfaction levels. The system will then be deployed in the 2nd month and left to run and operate for
the following 4 months. In the last month of the research the same surveys, interviews and focus group
meetings will be had with the adolescent sample group brought in before. Their psychological, mental,
physical, and emotional well-being satisfaction levels will be evaluated again and compared to the data
gathered the first time. These results will be put against the objectives of the system to see if the
experiment was effective and evaluation reports will be constructed, to look at the effectiveness and
efficiency of the system and what can potentially be improved.
12
5.6 ETHICS
The nature of the project presents potential risks with regards to the testing method of the overall
system. The data collection and gathering process of the cyberbullying system does not present any
ethical challenges as the initial data used for model feature selection and training the model are made
available to the public for reasons such as this research project’s one, which is building, training, and
testing machine leaning/ natural language processing models.
The issue is presented where adolescents participate in surveys, interviews and focus groups to gauge
their psychological, physical, emotional, and mental well-being satisfaction levels. This presents an
ethical/ moral challenge because the psychological, emotional, mental, and physical well-being
satisfaction levels of human beings is being used for testing data. In some respects, this can be deemed
as inhumane. However, for the adolescents to participate in this study they will be made aware of what
the testing process is and how it may affect them before ethical consent is gathered from them.
All participants data and responses to the surveys, interviews and focus group questions will be kept
confidential, and participants can choose to remain anonymous, as this process will be voluntary.
Participants can also choose to withdraw form the study at any time if they choose to. The collected raw
data, and final report of this research project will be kept in confidentiality by the Belgium Campus
institution for 2 years after its submission.
6
DELINEATIONS, TIMELINE, BUDGET AND LIMITATIONS
1. Delineations
For this research project, the research will not cover the exact information asked from the sample group
who will participate in the surveys, interviews and focus groups, neither will their confidential data be
shared or covered. This research project will also not go into the details of the sample group, such as
when they use social media the most frequently or the activities they are doing/ participate in on the
social media sites.
2. Timeline
The timeline of the research will be over the course of the 2023 year, beginning in February and being
completed in August.
3. Budget
13
The budget limitations for this research project are that all research and additional processes should be
able to aspire ‘free-of-charge’. All research, evaluation and testing efforts will be done using open
source/ free technologies.
4. Limitations
The limitations in this research project are presented by the chosen research method. A limitation
presented by this research project is the workload. Being a research approach that makes use of both
quantitative and qualitative research methods, the research result takes a lot of time and effort to
produce, and this has the potential to extend the given timeline for the project.
Another limitation the chosen research method presents is differing or conflicting results. Having both
quantitative and qualitative research leaves room for the different research methods to conflict with
each other. For example, the quantitative results in this research project could portray that lots of
cyberbullying content is being detected and successfully censored, however the qualitative research
results could portray a further drop in the psychological, mental, emotional, and physical well-being
levels of the adolescent group or regular social media users in general. This would present a challenge
because only some of the objectives will be fulfilled and it will be harder figure out the cause of the seen
effect.
7
ASSUMPTIONS
The assumptions made for this research project are that the sample group for the psychological, mental,
physical, and emotional well-being testing are all adolescents and that the vast majority of them do not have
other present mental illnesses or disorders that may skew the given data.
Another assumption that will be made is that the adolescent group is a group of regular social media users,
which would suggest that they are actively using a social media platform(s) which means that they use social
media platforms for the average 10 hours of the week (Chaffey, 2022). Another assumption is that the
adolescents used in the sample testing group will be honest about the state of their psychological, mental,
physical, and emotional health throughout the duration of the research study, so as to not skew the data.
8
OUTCOMES, CONTRIBUTION AND SIGNIFICANCE
1. The reason the work is worth doing.
14
This research is worth doing because of the potential positive effect that it may have on the
psychological, mental, emotional, and physical well-being of the adolescents in the sample group and
effectively all regular social media users.
2. Outcomes
If successful, the product of the research will be a cyberbullying detection system that operates on social
media and censor’s offensive and potential cyberbullying content from other users or completely does
not allow a social media user to post their offensive comment/ media because of its cyberbullying
content. Thus, positively impacting the psychological, mental, emotional, and physical well-being of
social media users and previous cyberbullying victims.
3. Contribution
The research study will provide a way a way in which an NLP system can detect cyberbullying. The
methods used might be new a method and the implementation will contribute a different methodology
to address the cyberbullying detection topic. Thus, providing more valid research to the field of
cyberbullying detection and natural language processing techniques.
4. Significance
The significance of the research is that it will provide a way in which in a natural language processing
system can be successfully implemented to detect cyberbullying in its context on social media to prevent,
stop or reduce further cyberbullying from happening and its negative effects on social media users going
forward.
9
CONCLUSION
It can be deduced that since the advent of the internet and distribution of digital devices amongst the youth,
traditional bullying has taken an online form called cyberbullying. Adolescents are more likely to be exposed
to this cyberbullying due to their frequent exposure to social media and this has an extremely negative effect
on their psychological, physical, emotional, and mental well-being.
This research proposal thus saw it necessary to conduct a research study on a system that can automatically
detect cyberbullying incidents that happen on social media platforms and then proceed to censor such
content from other users or completely disallow such content from being posted on these social media
platforms. This research studies approach will be deductive in nature and have a mixed method approach to
entirely gauge the effectiveness and efficiency of study and cyberbullying detection system.
15
In conclusion, this research study will be executed and carried out according to the aforementioned plans
and it can be deemed as necessary to help solve or alleviate the problem and the negative effects of
cyberbullying on social media, should its research objectives - which include: contributing to the prevention
of the cyberbullying problem by transferring and testing the developed cyberbullying detection methods on
mobile devices, identifying cyberbullying text on social media using natural language processing and
understanding its meaning and context, and figuring out the best implementation of NLP to detect
cyberbullying to such an extent that the psychological, mental, emotional, and physical well-being of
cyberbullying victims and regular social media users improves - be achieved.
16
References
Al-Garadi, M. A., Varathan, K. & Ravana, S. D., 2016. Cybercrime detection in online communications: The
experimental case of cyberbullying detection in the Twitter network. Computers in Human behaviour, Volume
63, pp. 433 - 443.
Alotaibi, M., Alotaibi, B. & Razaque, A., 2021. A Multichannel Deep Learning Framework for Cyberbulliyng
Detection On Social media. Electronics, 10 (2664), pp. 1 - 14.
Amanda Lenhart, M. M. A. M. A. S., 2007. Teens and Social media, Washington DC: Pew internet and American
life project.
Boyd, D., 2009. Why Youth (Heart) Social Network Sites: The Role of Networked Publics in Teenage Social Life,
Cambridge: Berkman Center.
Burnap, P. & Williams, M. L., 2015. Cyber Hate speech on twitter: An application of machine classification and
statistical modelling for policy and decision making. Policy \& internet, 7(2), pp. 223 - 242.
Capua, M. D., Nardo, E. D. & Petrosino, A., 2016. Unsupervised Cyberbullying Detection in Social Networks.
Cancun, International Conference on Pattern Recognition(ICPR).
Chaffey,
D.,
2022.
Global
social
media
statistics
research
summary
2022.
[Online]
Available at: https://www.smartinsights.com/social-media-marketing/social-media-strategy/new-global-socialmedia-research/
[Accessed 14 May 2022].
Dadvar, M., Trieschnigg, D. & Jong, F. d., 2014. Experts and Machines against Bullies: A hybrid approach to detect
cyberbullies. Spring International Publishing, 8436(1), pp. 275-281.
Farhangpour, P., Maluleke, C. & N.Mutshaeni, H., 2019. Emotional and academic effects of cyberbullying on
students in a rural high school in the Limpopo province, South Africa. South African Journal of Information
Managemnt, 21(1), pp. 1 - 8.
Gao, L. & Huang, R., 2017. Detecting online speech using context aware models. arXiv preprint arXiv:0710.07395.
Gitari, N. D., Zuping, Z., Damien, H. & long, J., 2015. A lexicon-based approach for hate speech detection.
International journal of Multimeida and Ubiquitous Engineering, 10(4), pp. 215 - 230.
Goodno, N. H., 2011. How public schools can constitutionally halt cyberbullying: A model cyberbullying policy
that considers first amendment, due process, and fourth amendment challenges.. Wake Forest L. Rev., 46(1), p.
641.
Gordon,
Available
S.,
2021.
at:
The
Real-Life
Effects
of
Cyberbullying
on
Children.
[Online]
https://www.verywellfamily.com/what-are-the-effects-of-cyberbullying-460558
[Accessed 14 May 2022].
17
Gordon,
S.,
Available
2022.
What
at:
Is
Cyberbullying?.
[Online]
https://www.verywellfamily.com/types-of-cyberbullying-460549
[Accessed 14 May 2022].
Hee, C. V. et al., 2018. Automatic detection of cyberbullying in social media text. Plos One, 13(10).
Hee, C. V. et al., 2015. Automatic detection and prevention of cyberbullying. s.l., Internation Conference on
Human and Social Analytics.
Hinduja, S. & Patchin, J. W., 2008. Cyberbullying: An Exploratory Analysis of Factors Related to Offending and
Victimization. Deviant Behavior, 29(2), pp. 129 - 156.
Hosseinmardi, H. et al., 2015. Detection of cyberbullying incident on the instagram Social network. arXiv preprint
arXiv, 1503(3909).
Huang, C. & Sushkov, M., 2016. InstaNet: Object Cliassification Applied to Instagram image Streams, California:
Standford Computer Science.
Ito, M. et al., 2008. Living and Learning with New Media: Summary of Findings from the Digital Youth Project,
Chicago: The MacArthur Foundation.
Kontostathis, A., Reynolds, K., Garron, A. & Edwards, L., 2013. Detecting cyberbullying: query terms and
techniques. New York, Proceedings of the 5th annual ACM Web science conference.
Kowalski, R. M., Limber, S. P. & Agatston, P. W., 2012. Cyberbullying: Bullying In The Digital age. s.l.:John Wiley
& Sons.
Lee, C. S. & Ma, L., 2012. News sharing in social media: The effect of gratifications and prior experience.
Computers in human Behaviour, 28(2), pp. 331 - 339.
Lopez-Meneses, E., Vazquez-Cano, E., Gonzalez-Zamar, M.-D. & Abad-Segura, E., 2020. Socioeconomic Effects in
Cyberbullying: Global Research Trends in the Educational Context. International Journal of Environmental
Research and Public Health, 17(12), p. 4369.
Navarro, R., Yubero, S. & Larranaga, E., 2016. Cyberbullying Acroos the Globe. 3rd ed. Cuenca, Spain: Springer
International Publishing Switzerland.
News24,
2021.
Parents
share
their
top
five
digital
concerns
in
local
survey.
[Online]
Available at: https://www.news24.com/parent/family/parenting/parents-share-their-top-five-digital-concernsin-local-survey-20210325
[Accessed 14 May 2022].
Nobata, C. et al., 2016. Abusive language detection in online user content. s.l., Proceedings of the 25th
International Conference on World Wide Web.
18
Patchin, J. W. & Hinduja, S., 2006. Bullies move beyond the schoolyard: A preliminary look at cyberbullying. Youth
Violence and Juvenile Justice, 4(2), pp. 148 - 169.
Pheto, B., 2021. More than half of SA's children have been cyberbullied, survey finds. [Online]
Available at: https://www.timeslive.co.za/news/south-africa/2021-03-10-more-than-half-of-sas-children-havebeen-cyberbullied-survey-finds/
[Accessed 14 May 2022].
Roux, R. l., Rycroft, A. & Orleyn, T., 2010. Harassment in the workplace - Laws, policies and processes. 1 ed.
Johannesburg: LexisNexis South Africa.
Sanchez, H. & Kumar, S., 2011. Twitter bullying detection. ser. NSDI, 12(2011), pp. 1 - 15.
Schmidt, A. & Wiegand, M., 2017. A survey on hate speech detection using natural language processing. Valencia,
Spain, Proceedings of the 5th International Workshop on Natural language processing for social media.
Sharma, H. K., Kshitiz, k. & Shailendra, 2018. NLP and Machine Learning Techniques for Detecting Insulting
Comments on Social Networking Platforms. Paris, International Confernce on Advances in Computing and
Communication Engineering.
Smit, D., 2015. Cyberbullying in South African and American schools : a legal comparative study. Sabinet African
Journals, 35(2), pp. 1 - 11.
Soni, D. & Singh, V. k., 2018. See no Evil, hear no evil, Audio-Visual-Textual cyberbulliny detection. Proceedings
of the ACM on human-Computer Interaction, Volume 2, pp. 1 - 26.
Stop
Bullying.Gov,
Available
2021.
What
at:
Is
Cyberbullying.
[Online]
https://www.stopbullying.gov/cyberbullying/what-is-it
[Accessed 14 May 2022].
Sue North, S. B. & Snyder, I., 2008. Digital Tastes: Social Class and Young People's Technology Use. 1 ed. London:
Routledge.
Weber, N. L. & William V. Pelfrey, J., 2014. Cyberbullying. Causes, Consequences, and Coping Strategies. 1 ed.
United States of America: LFB Scholarly Publishing LLC.
Wet, C. D., 2011. The professional lives of teacher victims of workplace bullying: A narrative analysis. Perspectives
in Education, 29(4), pp. 66 - 76.
Yin, D. et al., 2009. Detection of harassment on web 2.0. Proceedings of the Content Analysis in the WEB, Volume
2, pp. 1-7.
Zhang, X. et al., 2016. Cyberbullying Detection with a Pronunciation Based Colvulutional Neural Network. s.l., 15th
IEEE International Conference on Machine Learning and Applications, pp. 740 - 745.
19
Download