Uploaded by david rutalomba

EVERYJUSTUS BARONGO

advertisement
A FRAMEWORK FOR AUTOMATED DETECTION OF
OFFENSIVE MESSAGES IN SOCIAL NETWORKS IN
KISWAHILI
EVERYJUSTUS BARONGO
MASTER OF SCIENCE IN COMPUTER SCIENCE
THE UNIVERSITY OF DODOMA
OCTOBER, 2017
A FRAMEWORK FOR AUTOMATED DETECTION OF
OFFENSIVE MESSAGES IN SOCIAL NETWORKS IN
KISWAHILI
By
Everyjustus Barongo
A Dissertation submitted in partial fulfillment of the requirements for the degree of
Master of Science in Computer Science of the University of Dodoma
The University of Dodoma
October, 2017
CERTIFICATION
The undersigned certify that they have read and hereby recommend for the
acceptance by the University of Dodoma, this dissertation entitled, “A Framework
for Automated Detection of Offensive Messages in Social Networks in Kiswahili”
in partial fulfillments of the requirements for award of the Master‟s Degree of
Science in Computer Science at the University of Dodoma.
………………………………………
PROF. LEONARD MSELLE
(SUPERVISOR)
……………………………………
DR. MAJUTO MANYILIZU
(SUPERVISOR)
Date……………………….…………
i
DECLARATION
AND
COPYRIGHT
I, Everyjustus Barongo, declare that this dissertation is my own original work, and
that it has not been presented and will not be presented to any other university or
institution, for a similar or any other degree award.
Signature ……………………………………
No part of this dissertation may be reproduced, stored in any retrieval system, or
transmitted in any form or by any means without prior written permission of the
author or the University of Dodoma.
ii
ACKNOWLEDGEMENTS
First, I thank God the Almighty, for keeping me alive and healthy. Secondly, I owe the
deepest gratitude to my supervisors Prof. L. Mselle and Dr.M. Manyilizu for their
support. Their guidance helped me in all of the time of this dissertation.
My sincere thanks also go to Mr. T. Tesha and Miss. Basilisa for sharing with me
important ideas and resources like sample stop list file and technological background
(Jsoup).Their support is invaluable and highly appreciated.
Finally, I would like to extend my appreciation to my father, brothers and sisters for
their moral support and encouragements they gave me from the beginning till the end
of my master‟s study.
iii
DEDICATION
Dedicated to the memory of my mother Ester T. Katebe
iv
ABSTRACT
The diffusion of information generated in Social Networks Sites is the result of more
people being connected. The connected people chats and comments by posting
contents like images, video and messages. In fact the social networks have been and
are useful to communities in such they bring relatives together especially in sharing
experiences and feelings. Although social networks have been beneficial to users,
some of the shared messages and comments contain sexual and
political
harassments. This is particularly the same in Kiswahili speaking countries like
Tanzania. Most if not all of the Kiswahili social networks sites, the offensive
messages have been and are publicly posted. These messages harass, embarrass, and
even assault users and to some extent lead to psychological effect. This study
propose a framework for automating the detection of offensive messages in social
networks in Kiswahili settings by applying some selected machine learning
techniques. Specifically the study created Kiswahili dataset containing sexual and
political offensive messages and normal messages1. All of these messages were
collected from Facebook, YouTube and JamiiForum and they were used for
evaluating the performance of the selected text classification algorithms. The
collected messages were preprocessed by using Bag-of-Word (BoW) model, Term
Frequency Inverse Document Frequency (TF-IDF) and N-grams techniques to
generate feature vectors. The experimental findings using the generated feature
vectors, showed that, the Random Forest classifier was capable of correctly
assigning a message into a correct class label with accuracy of 95.0259% ,f1Measure of 0.950 (95.0%) and false positive rate of 2.8 % when applied to three
categorical dataset. On the other hand, the SVM-Linear showed better results when
applied in two categorical data. The study suggests the REST API based framework
with random forest classifier and Kiswahili dataset to be deployed in real social
networks sites to facilitate the real-time detection of offensive messages.
1
Clean messages which doesn‟t contain any form of offensive strategy
v
TABLE OF CONTENTS
CERTIFICATION ........................................................................................................ i
DECLARATIONANDCOPYRIGHT .......................................................................... ii
ACKNOWLEDGEMENTS ........................................................................................ iii
DEDICATION ............................................................................................................ iv
ABSTRACT ................................................................................................................. v
LIST OF TABLES ....................................................................................................... x
LIST OF FIGURES..................................................................................................... xi
LIST OF APPENDICES ............................................................................................ xii
LIST OF ABBREVIATIONS ................................................................................... xiii
CHAPTER ONE: INTRODUCTION ...................................................................... 1
1.0 Introduction ............................................................................................................ 1
1.1 Background of the Study........................................................................................ 1
1.1.1 Web 2.0 Technologies and Social Media ............................................................ 2
1.1.2 Statistical Usage of Social Networks Sites ......................................................... 3
1.2 Statement of the Problem ....................................................................................... 6
1.3 Objective of the Study ............................................................................................ 7
1.3.1 General Objective................................................................................................ 7
1.3.2 Specific Objectives.............................................................................................. 7
1.4 Research Questions ................................................................................................ 8
1.5 Significance of the Research .................................................................................. 8
1.5.1 To other Researchers and Academicians ............................................................ 8
1.5.2 To the OSNs Providers and Developers.............................................................. 8
1.5.3 To the Policy Makers and Law Enforcers ........................................................... 9
1.5.4 To the OSNs Administrators and Normal Users ................................................. 9
1.6 Scope of the Study ................................................................................................. 9
1.7 Limitations of the study ......................................................................................... 9
1.8 Organization of the Study .................................................................................... 10
CHAPTER TWO: LITERATURE REVIEW ....................................................... 11
2.0 Introduction .......................................................................................................... 11
2.1 Conceptual definitions ......................................................................................... 11
vi
2.1.1 Offensive Language .......................................................................................... 11
2.1.2 Web Crawling and Data Extraction .................................................................. 12
2.1.3 Machine Learning ............................................................................................. 12
2.1.3.1 The Categories of Machine Learning Methods .............................................. 13
2.1.3.2 Supervised Learning ...................................................................................... 13
2.1.3.3 Unsupervised Learning .................................................................................. 14
2.1.3.4 Reinforcement Learning ................................................................................ 14
2.2 Text Classification Algorithms ............................................................................ 14
2.2.1 Naïve Bayes ...................................................................................................... 15
2.2.2 Support Vector Machine (SVM) ....................................................................... 16
2.2.3 Artificial Neural Networks- Multilayer Perceptrons (ANN-MLP) ................... 17
2.2.4 Random Forest .................................................................................................. 18
2.2.5 Decision Tree Classifier (J48) ........................................................................... 20
2.2.6 Baseline Classifiers ........................................................................................... 20
2.3 Model Evaluation ................................................................................................. 21
2.3.1 Evaluation Metrics ............................................................................................ 21
2.3.2 Cross-validation ................................................................................................ 23
2.4 Empirical Studies ................................................................................................. 23
2.5 Research Gap ....................................................................................................... 26
2.6 Conceptual Framework ........................................................................................ 27
2.7 Conclusion ........................................................................................................... 28
CHAPTER THREE: METHODOLOGY .............................................................. 29
3.0 Introduction .......................................................................................................... 29
3.1 Research Design ................................................................................................... 29
3.2 Research Setting ................................................................................................... 29
3.3 Research Approach .............................................................................................. 29
3.4 Data Collection Method and Tool ........................................................................ 30
3.4.1 Primary Data ..................................................................................................... 30
3.4.2 Secondary Data ................................................................................................. 30
3.4.3 Experiment Tool and Environment ................................................................... 30
3.4.4 Experimentation Steps ...................................................................................... 31
3.4.5 Data Preprocessing ............................................................................................ 31
vii
3.4.6 Feature Extraction ............................................................................................. 32
3.4.7 Training the Model ............................................................................................ 32
3.4.8 Evaluating the Models....................................................................................... 32
3.5 Data Analysis ....................................................................................................... 33
3.6 Ethical Issues ........................................................................................................ 33
3.7 Reliability and Validity ........................................................................................ 34
3.8 Conclusion ........................................................................................................... 34
CHAPTER FOUR: RESULTS AND DISCUSSION ............................................ 35
4.0 Introduction .......................................................................................................... 35
4.1 Creating Kiswahili Dataset of offensive messages from social networks ........... 35
4.1.1 Data Extraction from Social Networks ............................................................. 35
4.1.2 Messages Annotation ........................................................................................ 36
4.1.3 Data Preprocessing ............................................................................................ 38
4.1.4 Feature Representation, Extraction and Selection ............................................ 39
4.2 Build and evaluate model by applying some machine learning algorithms......... 40
4.2.1 Baseline classifiers performance ....................................................................... 41
4.2.2 Detecting Offensive Messages .......................................................................... 42
4.2.3 Size of Training Sample .................................................................................... 43
4.2.4 Feature Representations .................................................................................... 44
4.2.5 Categories .......................................................................................................... 47
4.2.6 Time Taken to Train and Test model ................................................................ 49
4.3 Proposed framework for detecting offensive Kiswahili messages ...................... 51
4.3.1 Proposed Framework Architecture ................................................................... 51
4.3.2 Components details of the proposed framework ............................................... 51
4.3.3 Framework Properties ....................................................................................... 52
4.3.4 Implementation Consideration .......................................................................... 53
4.4 Conclusion ........................................................................................................... 54
CHAPTER FIVE: SUMMARY, CONCLUSION AND RECOMMENDATION55
5.0 Introduction .......................................................................................................... 55
5.1 Summary of the Study .......................................................................................... 55
5.2 Conclusion ........................................................................................................... 56
viii
5.3 Recommendations ................................................................................................ 57
5.4 Area for Further Research .................................................................................... 58
REFERENCES ......................................................................................................... 60
APPENDICES .......................................................................................................... 65
ix
LIST OF TABLES
Table 1. 1: Sample categories of social media ............................................................. 2
Table 4. 1: Message distribution ................................................................................ 35
Table 4. 2: Training dataset distribution category-wise ............................................. 37
Table 4. 3: Testing dataset distribution ...................................................................... 38
Table 4. 4: Baseline performance............................................................................... 41
Table 4. 5: The performance of Classifiers on varying data size ............................... 43
Table 4. 6: Classifiers performance based on Features Representation ..................... 45
Table 4. 7: Performance on Dataset with 3-categories............................................... 47
Table 4. 8: Performance for Dataset with 2-categories .............................................. 48
x
LIST OF FIGURES
Figure 1. 1: Social Networks Usage (2010-2020) ........................................................ 3
Figure 2. 1: Support Vector Machine Margin ............................................................ 16
Figure 2. 2: Artificial Neural Network-Multi-Layer Peceptron ................................. 18
Figure 2. 3: Pseudo-code for Radom Forests ............................................................. 19
Figure 2. 4: Study Conceptual Framework ................................................................ 27
Figure 3. 1: Text Classification Framework .............................................................. 31
Figure 4. 1: Distribution of Messages in Training Dataset ........................................ 37
Figure 4. 2: The most frequent words in dataset ........................................................ 39
Figure 4. 3: Performance Evaluation on Accuracy .................................................... 42
Figure 4. 4: Classifiers Learning Rate........................................................................ 44
Figure 4. 5: Classifiers performance based on Features Representation.................... 45
Figure 4. 6: Performance comparisons on n-gram feature ........................................ 46
Figure 4. 7: Performance for General purpose and Categorical models .................... 48
Figure 4. 8: Comparison of False Positive Rate ......................................................... 49
Figure 4. 9: Time taken to train and test model ......................................................... 50
Figure 4. 10: Study Proposed Framework .................................................................. 51
xi
LIST OF APPENDICES
Appendix 1: Sample arff Message file ....................................................................... 65
Appendix 2: Sample list of stop words ...................................................................... 66
Appendix 3: Corrections Report as per External Supervisors Observation ............... 71
xii
LIST OF ABBREVIATIONS
AI
Artificial Intelligence
ANN-MLP
Artificial Neural Network- Multi Layer Perceptions
ANN
Artificial Neural Network
API
Application Programming Interface
BOW
Bag-of-Word
CIVE
College of Informatics and Virtual Education
DMT
Data Mining Technique
HTTP
Hypertext Transfer Protocol
IBK
Instance-Based K-nearest neighbor
JSON
JavaScript Object Notation
ML
Machine Learning
MLA
Machine Learning Algorithms
MNOs
Mobile Network Operators
NLP
Natural Language Processing
OSNs
Online Social Networks
POS
Part-of-Speech
REST
Representation State Transfer
SA
Sentiment Analysis
SL
Supervised Learning
SVM
Support Vector Machine
TC
Text Categorization
TCRA
Tanzania Communication and Regulatory Authority
TF-IDF
Term Frequency Inverse Document Frequency
TOS
Terms of Services
WEKA
Waikato Environment for Knowledge Analysis
xiii
CHAPTER ONE
INTRODUCTION
1.0 Introduction
This chapter discusses the key concepts that define the disciplinary subject matter
concerning the study whose design and findings are presented in this dissertation. A
critical review of some of the key ideas that were instrumental in the choice of the
research topic is presented. Background information to the research topic is followed
by the definition of the research problem, the study objectives as well as the
research questions that guided the study. The chapter is concluded by discussing the
significance of the study, its scope, limitations and the organization of the research
report.
1.1 Background of the Study
Online Social Networks (OSNs) sites are computer driven social networks which
provide users with the flexibility to create different online issues like communities,
share information, ideas and personal messages. Since their establishment, OSNs
such as MySpace, Facebook, Twitter, YouTube, Google+, Cyworld and Bebo, have
attracted millions of users around the world. Many of these global users have
integrated these sites into their daily practices such as business (Ellison & Boyd,
2008). Among other factors, the prosperous of online social networks is due to web
2.0 technologies. Web 2.0 technologies are based on social nature which provides
users with flexible collaboration. In addition, web 2.0 technologies have resulted in
different varieties of social media (Vanhove et al., 2013).
1
1.1.1 Web 2.0 Technologies and Social Media
The Web 2.0 technologies are categorized depending on their main purposes.
According to Lutu (2015), Web 2.0 technologies enable users to create and share
social media. Table 1.1 provides a summary of some of the well-known categories of
social media.
Table 1. 1: Sample categories of social media (Lutu, 2015)
Media category
Purpose
Example of service
Blogs(web-logs)
Facilitate the expression of
Michuzi blog,
personal opinions by the public
JamiiForum
Facilitate the expression of
Twitter
Micro blogs
personal opinions about what is
happening right now
Social networks
Professional or social networking
Facebook, LinkedIn,
sites which facilitate meeting
twitter
people and sharing content
Collaborating
Collaborative reference works
Wikipedia
(e.g. Wikipedia) that are built
using wiki-style software tools
Media sharing
Facilitate the sharing of digital
YouTube
media, e.g. videos
The advancement of the Web 2.0 technologies and social media and the reasons for
their widespread have been attributed by the rapid increase in mobile devices like
smart phones, tablets, laptops and desktop computers and the availability of
affordable internet facilities provided by the Mobile Network Operators (MNOs). In
addition, the social messengers or chat apps installed in the mobile phones and
computers have increased their growth. Furthermore, social networking sites have
2
gained popularity because they provide users with opportunity to meet with new
people and join groups of their own interests. Moreover, these sites are free to get
and do not require users‟ designing or publishing skills (Asur & Huberman, 2010;
Hee et al., 2015).
1.1.2 Statistical Usage of Social Networks Sites
A large number of users are connected on the social network sites. According to
World Newsmedia Network(2015), Facebook is the most popular, followed by
WhatsApp, WeChat and Twitter. WhatsApp is reportedly to have averages 600
million users per month, while WeChat have about 500 million users and Twitter
have averages of 300 million users per month. Furthermore, according to Statista Inc
(2017) social media penetration worldwide is increasing with 68.3% of internet
users being social media users. Social networking is becoming one of the most
popular online activities with high rate of user engagement and expected to increase
as indicated in Figure 1.1.
Figure 1. 1: Social Networks Usage (2010-2020) (StatistaInc, 2017)
3
Huge amount of information is generated as more people are connected on social
networking sites. Whilst social media messages are important to enhance
communication and business they also result into offending others. The explosion of
online social media has given rise to concerns about new forms of offensive
messages. The offensive messages contain violence, aggression, and volumes of
inappropriate messages including those likely to be assaulting, annoying, or
harassing to a recipient. Because of the aforementioned phenomenon, a new field of
study called Sentiments Analysis (SA) has been introduced. This new field of study
is also called Opinion Mining, which analyzes people‟s opinions, sentiments,
evaluations, appraisals, attitudes, and emotions towards entities such as products,
services, organizations, individuals, issues, events, topics, and their attributes
(Laskari & Sanampudi, 2016). Because of these emotions toward entities, opinions
and attitude, automated offensive messages detection techniques form apart of SA
(Sood et al., 2012).
SA involves the application of Machine Learning (ML) techniques, Data Mining
Techniques (DMT), Natural Language Processing (NLP), Computational Linguistics
(CL) and Mathematics to build models which extract insights or knowledge from
social media and categorize data into different classes. Machine learning is the field
of Artificial Intelligence (AI) that provides computers with the ability to learn
without being explicitly programmed. Thus, it focuses on developing learning
algorithms that perform learning tasks automatically and exhibit intelligent
behaviors without human intervention (Muhammad & Yan, 2015). To be able to
build accurate models, machine learning is categorized as either supervised,
unsupervised or reinforced and each category being applied depends on the task to
be solved and type of data to be processed (Dasgupta & Nath, 2016).
4
The same benefits of web 2.0 technologies are observed in developing countries,
particularly in Tanzania where people are connected to these different social
networks. Their penetration and adoption rate have been attributed to the relative
availability and affordability of Smartphone, laptops and desktop computers as well
as robust internet facilities offered by Mobile Network Operators (MNOs)
(Msavange, 2015; TCRA, 2010). In addition, the social network sites are language
independent in which Tanzanians use Kiswahili for social networking.
Despite the benefits observed in social networks, some people misuse the medium
by promoting offensive and hateful language like “Mimi namchukia huyu baba
namungu anisamehe nitahira na nimwehu mpaka anaowachunga wote matahira
nawehu”. The dimensions of these misconducts range from hate speech, cyberbulling and cyber–stalking all targeting specific group characteristics such as race,
ethnic origins, gender, religion and sexual innuendo (Reynolds, 2012). Also, some
users send suspicious messages, insults or provoke other people. All of these
behaviors are contrary to social network Terms of Service (TOS) (LegalAid, 2010)
and Tanzania‟s Cybercrime Act of 2015 (Tanzanian government, 2015).
To some extent, it may be argued that the initiative made by the Tanzanian
government through the passing of the Cybercrime Act of 2015 has created some
basis for rescuing the situation, whereby people may now be sued or fined if they
behave abusively in OSNs. However, social networks‟ data are of high volume,
velocity and variety. For administrators or legal authorities to review these online
messages and other posts to detect offensive content manually is extremely labor
intensive endeavor as well as time consuming, and not suitable and scalable in
reality (Chen, 2012). Moreover, people‟s messages might be misclassified by
5
placing them in wrong categories because of personal perceptions, interest and/or
hatred which may result into unfair pains to them. Lastly, with human judgments on
the same problem a number of discrepancies may result because different sensitivity,
mood, background and some other subjective conditions among different people
(Razavi et al., 2010).
1.2 Statement of the Problem
Manual classification of the offensive messages may work well for small datasets.
This means that when dealing with a small number of groups of individuals, the
offensive messages are often few and can be eliminated easily. However, when the
recipient receives lots of data it becomes difficult to detect offensive messages,
hence, the application of machine learning and data mining techniques become
crucial parts (Liu, 2012).
To automate the detection process of offensive messages in social networks, several
approaches and techniques have been proposed. Bretschneider and Peters (2017) and
Papegnies et al.,( 2017) implemented automated approaches to detect offensive
language
statements
towards
immigrants/foreigners
and
online-community
respectively. The proposed approaches were based on German and French languages
for social media dataset. Furthermore, a dictionary based-approach (Hilte et al.,
2016) was suggested and implemented to detect racism in Dutch Social Media.
Despite the already suggested approaches towards offensive language detection most
of the studies show that, existing approaches are language dependent. Their datasets
were prepared from specific language settings like English, Germany, French, and
Dutch. Such approaches raise a need to conduct a study based on Kiswahili
language, given the fact that it is a complex morphological, syntactical, and
6
semantically language and having rapid immerging words which demands special
treatments (Massamba et al., 1999, Tesha, 2015). In addition, Kiswahili language is
a national language in Tanzania and a lingua franca in much of Eastern and central
Africa (Mulokozi, 2000; Hinnebusch, 2003). Moreover, it is hardly to find a
framework that has focused on automated detection and discrimination of offensive
messages in social networks in Kiswahili language.
Therefore, based on the aforementioned shortcomings, there is justification wishing
to undertake a study in this subject field. The aim of this research therefore, was to
propose a framework for detecting offensive Kiswahili messages on social networks
sites by applying machine learning models. First, the automated process will help to
eliminate or mark offensive messages from a list of messages before being shared.
Second, the framework will increase trustfulness, user experiences and promoting
gigantic adoption of social networking sites in Kiswahili speaking countries.
1.3 Objective of the Study
1.3.1 General Objective
The purpose of this study was to propose a framework for automating the detection
of offensive messages in social networks under Kiswahili settings by applying
appropriate Machine Learning Algorithms (MLA).
1.3.2 Specific Objectives
i. To create Kiswahili dataset of offensive messages from social networks for
generating feature vectors.
ii. To build and evaluate the model by applying appropriate machine learning
algorithms techniques for Kiswahili dataset.
7
iii. To propose the architectural framework that can be adopted in social network
sites for detecting offensive language in Kiswahili setting.
1.4 Research Questions
i. How Kiswahili dataset of offensive messages from social networks can be
created for generating feature vectors?
ii. How machine learning algorithms can be applied and evaluated in building the
models which can detect improper messages in Kiswahili?
iii. How a framework for detecting offensive Kiswahili messages in social
network sites can be designed?
1.5 Significance of the Research
The study sought to contribute to the growing literature the automated detection and
barring of offensive messages that have become of increasing concern to most users
of online social media. Findings of the study are intended to benefit various groups
who are either directly or indirectly concerned with proper usage of online social
networks (OSNs).
1.5.1 To other Researchers and Academicians
The study aim to contribute to the body of knowledge on how to apply Machine
Learning Algorithms can be applied in detecting improper Kiswahili messages in
social networks. The accurate MLA applied in Kiswahili language would also be
used in other task such as businesses for the Tanzanians.
1.5.2 To the OSNs Providers and Developers
The study findings will help the social network developers and providers to adopt
and integrate the developed framework into their applications so that they can filter
8
improper posts automatically before they are shared to different users. Through
objective one also, the study will help to keep digital evidences for forensic
investigations in case they are required.
1.5.3 To the Policy Makers and Law Enforcers
The policy makers and law enforcers like TCRA may adopt and use the framework
to develop a tool for detecting online offenders and take appropriate disciplinary
measures against them. The model will help to eliminate human interventions,
reduce labor intensive methods of detecting social media messages that seek to
promote hatred while also seeking to save time thus, resulting into balanced
judgments. The framework will also help to keep digital evidences for forensic
investigations.
1.5.4 To the OSNs Administrators and Normal Users
The model will help normal users to feel free and more comfortable while being
online and joining groups as there will be no more insult and abusive languages in
social networks.
1.6 Scope of the Study
The study is confined on verbal Kiswahili messages posted on Facebook,
JamiiForum and YouTube social networks only. The choice of these social networks
was due to the availability of publicly pages for easy extraction of messages.
1.7 Limitations of the study
i.
Limited access to information, some messages were not accessible due to the
confidentiality and privileges to join the groups, and also not all of
respondents were cooperative enough, to provide some of the information
9
hence, led to difficulties in collecting data in time. To handle this issue, only
publicly available pages were only considered.
ii.
Time constraints, Time allocated was very short, the study needed more time
for making it more effective and efficient in collecting more messages from
social networks. The study handled this challenge through increasing more
numbers of hours for collecting messages from three social networks.
1.8 Organization of the Study
The remaining part of the study report is organized as follows. Chapter two
discusses the literature that was reviewed during the study. Chapter three presents
the methodology that was employed for conducting the study, including outline of
the data collection, experiment setup and analysis methods to accomplish each
specific objective, ethical issues that were considered during the entire study and the
validity and reliability of the data. Chapter four presents and discusses the
experimental findings obtained in order to address research questions. Chapter five
presents conclusion, recommendations and areas for further investigations.
10
CHAPTER TWO
LITERATURE REVIEW
2.0 Introduction
This chapter discusses in summary form the various published material that were
consulted in order to understand and investigate the research problem. The literature
review facilitated the researcher‟s efforts in availing oneself with conceptual
understanding and definition of key terms, text classification algorithms, as well as
acquaintance with empirical studies that focus on the frameworks and detection
techniques of offensive messages in social network sites. It also narrates research
gap, conceptual framework and conclusion remark.
2.1 Conceptual definitions
2.1.1 Offensive Language
Offensive language has been defined by Razavi et al.,(2010) as phrases which can
mocks or insult somebody or a group of people (attacks such as aggression against
some culture, subgroup of the society, race or ideology in a tirade). Moreover, they
itemized several categories of offensive language.
Taunts: These phrases try to condemn or ridicule the reader in general.
References to handicaps: These phrases attack the reader using his/her shortcomings
(i.e., “IQ challenged”).
Squalid language: These phrases target sexual fetishes or physical filth of the reader.
Slurs: These phrases try to attack a culture or ethnicity in some way.
Homophobia: These phrases are usually talking about anti- homosexual sentiments.
Racism: These are phrases that intimidate the race or ethnicity of individuals
Extremism: These phrases target some religion or ideologies.
11
There are also some other kinds of flames, in which the flamer abuses or
embarrasses the reader (not an attack) using some unusual words/phrases like:
Crude language: expressions that embarrass people, mostly because it refers to
sexual matters or excrement.
Disguise: expressions for which the meaning or pronunciation is the same as another
more offensive term.
Provocative language: expressions that may cause anger or violence.
Unrefined language: some expressions that lack polite manners and the speaker is
harsh and rude.
Although offensive language is still an ambiguous terms, this study adopted and
referred to the above definitions more specifically on squalid language (Sexual),
politics related offensiveness messages. Moreover, the study used the mentioned
areas to collect and analyze data from social network sites so as to create Kiswahili
dataset.
2.1.2 Web Crawling and Data Extraction
Web crawling involve automated collection of information from the web. Web
crawling is performed by specialized web crawlers. The crawlers collect and update
specific web content in order to perform web search and indexing of content
(Barcaroli et al., 2014).
2.1.3 Machine Learning
Mitchell (1997) defined machine learning as a computer program which learns from
experience E with respect to some task T and some performance measure P , if its
performance on T , as measured by P improves with experience E . In point of fact,
12
this was as well defined by (Alpaydın, 2010) as the process of searching for a good
function F : I  O , where I the set of possible inputs, and O the set of possible
outputs. Therefore, machine learning involves devising the learning models which
in fact can automatically adjust with external data or environment.
2.1.3.1 The Categories of Machine Learning Methods
According to Alpaydın (2010) three typical categories of machine learning namely
supervised learning, unsupervised learning and reinforcement learning:
2.1.3.2 Supervised Learning
In Supervised Learning (SL) the algorithm observes some example input–output
pairs and learns a function that maps from input to output. In this case, the training
set given for the algorithm is the labeled dataset in which the learning process is to
find the relationships between the feature set and the label set. The resulting
relationship is the estimated function F : X  Y of the given labeled set for the

training examples
 x, y  known as a model (Kotsiantis, 2007). The resulting
classifier F  is then used to assign class labels to the testing instances where the
values of the predictor features are known, but the value of the class label is
unknown. Thus, If each feature vector x is corresponding to a label
y  L, L  l1 ,l 2 .......lc (𝑐 is usually ranged from 2 to a hundred), the learning
problem is denoted as classification. On the other hand, if each feature vector x is
corresponding to a real value y  R ; the learning problem is defined as regression
problem (Chao, 2011). The knowledge extracted from supervised learning is often
utilized and applied in prediction tasks.
13
2.1.3.3 Unsupervised Learning
The training set given for the unsupervised learning algorithm is the unlabeled
dataset. For the given feature vector x for data D  x0 , x1 ,. .. xn  the aim is to look
for a model F which gives some useful insight in the data D . The most common
unsupervised learning task is clustering: detecting potentially useful clusters of input
examples. Other examples includes probability density estimation, finding
association among features, and dimensionality reduction (Xu & Wunsch, 2005). In
general, an unsupervised algorithm may simultaneously learn more than one
properties existing in the dataset, and the results from unsupervised learning could
be further used for supervised learning (Nilsson, 2005).
2.1.3.4 Reinforcement Learning
Reinforcement learning (RL) uses a scalar reward signal to evaluate input-output
pairs and hence discover, through trial and error with its environment, optimal
outputs for each input (Sathya & Abraham, 2013).
2.2 Text Classification Algorithms
During SL technique one of the major outputs is to deduce a function that can assign
task into different classes known as classification or categorization. One of the
commonly known tasks in classification is text classification. According to
(Kotsiantis, 2007) Text Categorization (TC) is used to automatically assign
previously unseen documents to a predefined set of categories. To accomplish the
TC task several SL algorithms for text categorization has been suggested, applied
and evaluated accordingly.
14
2.2.1 Naïve Bayes
Naïve Bayes is a simple classifier based on the Bayes theorem. It is a statistical
classifier which performs probabilistic prediction. In reality, the classifier works
under the assumption that, the attributes are conditionally independent. An equation
2.1 shows a typical Naïve Bayesian formula as from the mathematical point of view
(Seif, 2016).
PCi / X  
P X / Ci PCi  ...........................
(2.1)
P X 
Using equation (2.1) the classifier, or simple Bayesian classifier, work as follows;
1) Let D be a training set of tuples and their associated class labels. Each tuple is
represented by an n-dimensional attribute vector, X  X1, X 2 ,. ....... , X n ,
depicting n measurements made on the tuple from n attributes, respectively,
A , A .........,A .
1
2,
n
2) Suppose that there are m classes C1,C2 ......... Cm  . Given a tuple, X, the classifier
will predict that X belongs to the class having the highest posterior probability,
conditioned on X. That is, the Naïve Bayesian classifier predicts that tuple X
belongs to the class Ci if and only if PC i / X   PC j / X for 1 ≤ j ≤ m; j≠ 1.
Thus we maximize PCi / X . The class Ci for which PCi / X  is maximized is
called the maximum posteriori hypothesis.
3) From equation (1), as P X  is constant for all classes, only P X / Ci PCi  need
be maximized. Then predicts data item X belongs to class Ci if and only if has
got the highest probability compared to other class label.
15
2.2.2 Support Vector Machine (SVM)
The SVM uses the concept of margin. It constructs a maximum margin separator
which in actual sense is a decision boundary with the largest possible distance to
example points. Suppose in training examples xi, and the target values yi {-1, 1}
SVM searches for a separating hyperplane, which separates positive and negative
examples from each other with maximal margin (optimal hyperplane) (Christopher,
2006; Boser, Guyon and Vapnik ,1992).
Figure 2. 1: Support Vector Machine Margin (Chang and Lin, 2011)
If the training data is linearly separable, then a pair (w, b) exists such that
W𝑇Xi + 𝑏 ≥ 1, ƒ𝑜𝑟𝑎𝑙𝑙Xi ∈ 𝑃
W𝑇Xi + 𝑏 ≤ −1, ƒ𝑜𝑟𝑎𝑙𝑙Xi ∈ 𝑁
………………………………………… (2.2)
16
SVMs (Boser, Guyon and Vapnik ,1992) find a maximum separating hyperplane
between the examples from the two classes. K xi , x j    x i T . x j  is called a
kernel function. Currently, there are several known kernel functions that can be
applied for solving various problems:
i.
Linear: Kx i   xi .T x j
ii.
Polynomial: K xi , x j  xi T ..xj  r d,  0
iii.
Gaussian radial basis function (RBF): K xi , x j  exp   || x i  x j || ,

 





2
for   0
iv.
Sigmoid: Kx i  tanh(x i.T x j  r)
In the kernel functions above, r and d are kernel parameters that need to be tuned
(Chang and Lin, 2011).
2.2.3 Artificial Neural Networks- Multilayer Perceptron (ANN-MLP)
The ANN-MLP is one among popular ANN consisting of multiple layers of
computational units, usually interconnected in a feed-forward way representing the
nonlinear mapping between input vector and the output vector (Tesha & Baraka,
2015). Each neuron in one layer has direct connections to the neurons of the
subsequent layer as depicted in Figures 2.2 (Christopher, 2006). MLP is most
suitable for approximating a classification function, and consists the input layer, one
or more hidden layers of processing elements, and the output layer of the processing
elements (Osmanbegović & Suljić, 2012).
17
Figure 2. 2: Artificial Neural Network-Multi-Layer Peceptron (Christopher,
2006)
The Multi-Layer Perceptron (MLP) is a supervised learning algorithm which uses
back-propagation to learn a function from given examples. Kumari & Godara,
(2011) on their part argue that the use of back-propagation for reducing classification
error by optimizing the weights makes MLP the most commonly used and wellstudies ANN architecture capable of learning arbitrarily complex nonlinear functions
to arbitrary accuracy levels. From Figure 2.2 (Danjuma & Osofisan, 2015) defines
ANN based on: a) Interconnection pattern between different layers of neurons; b)
Learning process for updating the weights of the interconnection; and c) Activation
function that converts a neuron‟s weighted input to its output activation.
2.2.4 Random Forest
Random Forest (RFs) was originally developed by Leo Breiman (2001) in 2001. It
combines two machine learning technique, one being bagging and the other random
feature selection. A cording to (Biau, 2012) RFs is an ensemble learning classifier
consisting of group of un-pruned or weak decision trees made from the random
18
selection of samples of the training data. Its premises is based on the concept of
building many small, weak decision trees in parallel and then combine the trees to
form a single, strong learners by aggregating (majority vote for classification or
averaging for regression) the predictions of the ensemble (Ali et al., 2012).
The algorithm work as follows: for each tree in the forest select a bootstrap sample
S* of size n from the training data and then learn a decision-tree using a modified
decision-tree learning algorithm. At each node of the tree, some subset of the
features v from p features is randomly selected. The node then splits on the best
feature in v rather than p. Finally, a random forest with M decision trees is formed by
repeating M times as above procedures and then the random forest is used to predict
test data as depicted in Figure 2.3. During testing, each test point is simultaneously
pushed through all trees (starting at the root) until it reaches the corresponding leaves
and classification is decided by all the votes (Criminisi et al., 2012).
Figure 2. 3: Pseudo-code for Radom Forests (Zhang & Haghani, 2015)
19
According to (Jia et al., 2013) Random Forest has two most significant parameters,
one is the number of features used for splitting each node of decision tree (m p where
p is the total number of features), another parameter is the number of trees (M).Since
Random Forest uses bagging and restrict each split-test to a small random sample of
features it decreases the correlation between trees in the ensemble and help to learn
more decision tree in a given amount of time. One obvious properties of the
algorithm is that it doesn‟t produce over fitting phenomenon when the characteristic
parameters of higher dimension are used.
2.2.5 Decision Tree Classifier (J48)
Decision tree classifier builds a decision tree based on if-then. According to (Kumari
et al., 2011) decision tree separate the training datasets recursively into small
branches to construct a tree for the purpose of improving prediction accuracy. This
step is repeated at each leaf node until the complete tree is constructed. The tree uses
entropy to determine the similarity of the sample to be split on the same node and
information gain which determine the smallest entropy value (Criminisi et al., 2012).
2.2.6 Baseline Classifiers
A baseline classifier gives baseline accuracy on the dataset that must always be
checked before choosing sophisticated classifiers. It is a method that uses heuristics,
simple summary statistics, randomness, or machine learning to create predictions for
a dataset. The resulting metrics will then become what must be compared with other
machine learning algorithm against.
20
ZeroR Classifier
ZeroR or Zero Rule is the classification method which depends on the target and
ignores all predictors. zeroR simply predicts the majority class and is useful for
determining a baseline performance as a benchmark for other classification methods
(Nasa, 2012).
OneR Classifier
OneR, short for "One Rule", is the classification algorithm that generates one rule for
each predictor in the data, and selects the rule with the smallest total error as its one
rule. To create a rule for a predictor, we have to construct a frequency table for each
predictor against the target as follows (Nasa, 2012). The algorithm also serves as the
baseline for evaluating other classifiers on the same dataset.
2.3 Model Evaluation
2.3.1 Evaluation Metrics
The evaluation of classification methods is based on the number of correctly
classified messages against falsely classified messages (Ian & Frank, 2005). There
are four different situations that occur when a new message is classified:
True Positive (TP): The classifier correctly indicates the message as offensive. In
other words, the message is rightly classified as sexual or politics.
True Negative (TN): The classifier correctly indicates the message is not offensive.
In other words, the message is rightly classified as normal.
False Positive (FP): The classifier wrongly predicts the message as offensive (sexual
or politics) when it is actually normal message.
21
False Negative (FN): The classifier wrongly indicates the message is normal when it
is actually offensive.
The simplest manner to evaluate the performance of a classification system is to
analyze its accuracy. Accuracy shows the general correctness of a classifier and is
calculated as follows:
Accuracy 
TP  TN
………………………………………….(2.3)
TP  TN  FP  FN
However, a classification system that automatically labels samples as normal
message would yield very high accuracy results when dealing with dataset that
contain only a very small amount of positive sample (sexual and politics)
(Vandersmissen, 2012). Accuracy may not work well in an environment where one
category dominates the others. Therefore, to use the metrics precision which implies
the proportion of predicted positive which are actual positive and recall, the
proportion of actual positives which is predicted positive as shown in equation 2.4
and 2.5 respectively.
Pr ecission 
Recall 
TP
……….………………………………………………….….(2.4).
TP  FP
TP
…………………………………………….…………...……(2.5)
TP  FN
Furthermore, by analyzing precision and recall provide better understanding of the
performance of the detection of offensive messages. The evaluation also involves the
f1 Measure, which is an evenly weighted combination of both precision and recall:
➚1 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 = 2 𝑥
𝑝𝑟𝑒𝑐i𝑠i𝑜𝑛 𝑥 𝑟𝑒𝑐𝑎𝑙𝑙 .................................................................
𝑝𝑟𝑒𝑐i𝑠i𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
22
(2.6)
2.3.2 Cross-validation
Apart from testing data on a separate test set, it is also possible to extract a small part
from training set to use as validation set and repeats the process several times. This
method is called cross –validation or stratified cross- validation (Ian et al., 2005). In
this technique the training dataset is divided into fixed k parts called folds and the
classifier is then subsequently trained on k-1 parts and tested on the remaining part.
The procedure is repeated k times so that in the end, every instance will have been
used exactly once for testing. The error rates on the different iterations are averaged
to yield an overall error rate. This technique provides an accurate view over the
performance of a classifier.
2.4 Empirical Studies
Bretschneider and Peters (2017), conducted a study on detecting offensive
statements toward foreigners in social media based on Germany language, they
conclusively proposed an approach to automatically detect such statements to aid
personnel in the labor-intensive task. They performed binary classification task and
multi class classification by applying machine learning approach on bag-of-words
(BOW) model. The developed models were evaluated by applying Precision, Recall
and F1-measure and yield the precision values (75.26% and 73.8%) and f1-value of
67.91%.
Chen et al., (2012) for their part, offer a proposal on how offensive language may be
detected in social media to protect adolescent online safety. In this case they
proposed the Lexical Syntactic Feature (LSF) architecture to detect offensive content
and identify potential offensive users in social media. Their experiment revealed
that, the LSF achieved precision of 98.24% and recall of 94.34% in sentence
23
offensive detection as well as precision of 79.9% and recall of 71.8% in user
offensive detection and the study applied English data sets, the processing speed of
LSF is approximately 10msec per sentence signifying the need to be adopted in
social media.
Garcia-Recuero (2016), conducted a study on discouraging abusive behavior in
Privacy-Preserving Online Social Networking applications. The collected data from
Twitter Social Media and created on Troll slayer to collect data from Twitter while
preventing users privacy on the sensitive data.
Saleem et al., (2016) for their part conducted a study on a Web of Hate: Tackling
Hateful Speech in Online Social Spaces. The base for the study was about proposing
hateful speech solutions to social media. They proposed an approach by selfidentifying hateful communities as training data set .They applied multiple Machine
Learning algorithms to generate the language models of hateful communities,
especially they applied Naïve Bayes (NB), Support Vector Machine (SVM) and
Logistic Regression (LR).In addition, they used Web scrapping Libraries to retrieve
all public available comments on Reddit website and a total of 50,000 comments
were collected for training and testing sets. Their study revealed that, SVM and NB
outperformed better than LR, they further suggested that, the same study can be
conducted on social networks like Twitter and Face book.
Killam et al.,(2016) classified Android Malware through Analysis of String Literals.
On their study, they applied a Linear Support Vector Machine (SVM) and a 3-gram
of word level to classify Android Malware through analysis String Literals. The
resulting model correctly classified the malware applications with the accuracy of
99.20% while maintaining the false positive rate of 2.00%.
24
Freeman (2015) applied Naïve Bayes to detect Spammy names in the Social
networks. The study was guided by the assumption that, in social networks there
exist fractious identities which might be used to send spam messages, engage in
abusive activities and post malicious links. Data was collected from Linkedln for
training and validating the model. The resulting model was evaluated by using Area
under the ROC Curve (AUC) and achieved AUC as 0.85.
Dewan and Kumaraguru (2017) on their study about Face book Inspector (FbI):
Towards automatic real-time detection of malicious content on Face book; data was
collected for 16 month time period on face book to develop training and test data set.
They applied Supervised Learning models including Naïve Byesian, Decision Trees,
Random Forest and Support Vector Machine-based models. They implemented
Based on Learning model, Face book Inspector, a REST API-based browser plug-in
for identifying malicious Face book posts in real time. In all aspects, SVM model
achieved accuracy of over 80% on publicly features while Random Forest Classifier
had a better recall and ROC AUC values.
Opesade et al., (2016) conducted a Forensic Investigation of Linguistic Sources of
Electronic Scam Mail using a Statistical Language Modeling Approach .Their study
aimed at investigating the propensity of Nigeria‟s involvement in authoring the scam
mails fraudulent. Their experiment study included a total of 873 scam mails 349 non
scam mails from different scam Baiter‟s Websites. The experiment was carried on
Waikato Environment for Knowledge Analysis (WEKA) data mining software and
among of all applied Machine Learning Algorithms, Instance Based K-nearest (IBK)
neighbor was found to be the most precise model in terms of accuracy and Kappa
statistics to detect the sources of scam mails.
25
Hee et al.,(2015) conducted a study on automatic detection and prevention of Cyberbullying. Their study was conducted through applying SVM as a learning algorithm
in Python programming. During data pre-processing they applied tokenization, Partof-speech (POS) tagging and lemmatization using LeTs pre-process Toolkit.
Evaluation of the resulting model was done using 10-fold Cross-validation and Fscore and recall as evaluation metrics. The resulting model was capable of detecting
cyber-bullying with F-score of 55.39% and accuracy of 78.50%.
Gerbet & Kumar,( 2014) on their part implemented Google Safe browsing database
system which classifies malicious URLs. It consists of API interface to which query
the state of a URL. From the Clients side, URLs is sent and checked using HTTP
GET or POST requests and the server‟s response contains directly an answer for
each URL query.
A similar study on cyber bullying detection was conducted by Huang et al (2014)
using social and textual analysis, where their dataset was divided into two parts, with
70% of the dataset (both bullying and non-bullying messages) were used as training
set and 30% being used as the testing set. Different classifiers were used in the
experiment, these included J48, Naïve Bayes, SMO, Bagging and Dagging. The
experiment was conducted in WEKA 3.0 as implementation tools. The evaluation
metrics employed were Receiver Operating Characteristic (ROC) and True Positive
rate were used.
2.5 Research Gap
Despite several studies that have been conducted regarding offensive messages
detection on social networks, still there is hardly any study that has been conducted
26
regarding Kiswahili language context. Furthermore, the proposed solutions from
these studies do not match with how one may deal with Kiswahili offensive
detection since they depends on the language context in which data were originally
collected. This raises the need to conduct the study by considering Kiswahili
language as the case of this study. This is because all existing languages in the world
differ in syntax, pragmatic and semantic which force the researcher not to relay on
the already conducted studies to save for Kiswahili environment but to considered
these studies as benchmark for carrying a new study. It is believed that this study
will bridge the currently existing gap and open up new research areas.
2.6 Conceptual Framework
The Figure 2.4 represents the conceptual framework that guided this study.
Figure 2. 4: Study Conceptual Framework
From Figure 2.4 text messages were collected from social network like Facebook by
using WebCrawler. The message containing offensiveness were labeled by using
human annotators and used to generate a feature vectors. The generated feature
vectors were used as training and test set to configure the MLAs, and the resulting
27
models were evaluated based on statistical metrics. The results obtained were used to
save as the inputs to propose the frameworks for automating the detection of
offensive messages in social networks.
2.7 Conclusion
This chapter has reviewed the literature that has a direct relation with the problem
under investigation. It began by defining the key terms used, followed by reviewing
the mathematical theories guiding the study and the empirical analysis of the similar
studies existing in the world. Finally, the chapter ends with a descriptive conceptual
framework, research gap and conclusion remarks. The next chapter will be
discussing research design and methodology employed in this study.
28
CHAPTER THREE
METHODOLOGY
3.0 Introduction
This chapter defines the research design, research approach, data collection methods
and tools, data analysis techniques, ethical consideration, as well as validity and
reliability issues.
3.1 Research Design
The research design that was adopted in this dissertation is experimental and case
study approaches. With the case study, only Kiswahili verbal messages extracted
from social networks such as Facebook, YouTube, and JamiiForum were the ones
considered. The choice of this design was based on the nature of study objectives
and questions (Saunders et al., 2009).
3.2 Research Setting
The study was conducted at College of Informatics and Virtual Education computer
laboratory. The whole software required 5.5 GB of hard drive space, 4 GB of
Random Access Memory (RAM) and 1GHz of processor. For the sake of this study,
the software was installed on Windows 8.1 desktop computer with 16GB of RAM,
core i7 with 3.4GHz processor and 1TB of hard drive in order to carry experiment.
3.3 Research Approach
The study has adopted mixed research approach aiming to collect and analyze
quantitative data from experimentation and qualitative data from documentary
review. The choice of mixed research approach was influenced by the methods of
data collection, analysis, and interpretation (Creswel, 2014).
29
3.4 Data Collection Method and Tool
3.4.1 Primary Data
To respond to the research question one which demands the collection of relevant
Kiswahili messages from social network, Jsoup API based web crawler was used.
The choice of this tool was due to their platform independent and ability to provide
robust application that can be applied across several social network sites (Nakash et
al., 2015) . Furthermore, with respect to what guided the researcher in the collection
and analysis of Kiswahili data, the grammatical conceptual framework of Kiswahili
syntax adopted by Masamba et al., (1999) was used as a guide.
Moreover, observation technique was used to collect data from experimental tool for
further analysis. This technique facilitated the process of drawing tables, comparison
and conclusion for defined research questions.
3.4.2 Secondary Data
Structured document review was used to determine the appropriate machine learning
algorithms for text classification task and to determine the evaluation metrics that
served for research question two. Relevant literatures that were mainly visited are
journals, essay, dissertations, and research projects using search terms.
3.4.3 Experiment Tool and Environment
In experimental setup the study used Waikato Environment for Knowledge Analysis
(WEKA) Toolkit. WEKA is a collection of machine learning algorithms for data
mining tasks. The algorithms can either be applied directly to a dataset or called
from own written Java code. WEKA contains tools for data pre-processing,
classification, regression, clustering, association rules, and visualization. It is also
well-suited for developing new machine learning schemes (Srivastava, 2014). The
30
tool can also be integrated with MEKA for multi-label learning and evaluation.
In multi-label classification, the goal is to predict multiple output variables for each
input instance. MEKA is based on the WEKA Machine Learning Toolkit; it includes
dozens of multi-label methods from the scientific literature, as well as a wrapper to
the related MULAN framework.
3.4.4 Experimentation Steps
To respond to the research question two, which focuses on developing and
evaluating the models, several steps need to be performed to reach the desired
output.
Figure 3. 1Text Classification Framework Adapted from Ramya & Pinakas
(2014)
3.4.5 Data Preprocessing
The messages collected were categorized into two parts. Category one had 75% of
the total collected messages and these were used for training the models while the
remaining 25% were used for testing the efficiency of the models. Since offensive
language is a subjective term, 75% of the training dataset were distributed to three
annotators, in which every participant was required to label the message as to
whether the message is offensive or not. A message was considered as either
31
offensive or non-offensive depending on the total scores obtained from the
annotators. The labeled training dataset was preprocessed into format relevant for
MLAs. The preprocessing task involved the removing stop list and converting the
string of words into feature vectors (Read, 2016).
3.4.6 Feature Extraction
i. Feature selection
Features generated were selected based on their importance in the word and in order
to improve the performance of the model. The feature selection method that was
adopted in this study was Term Frequency-Inverse Document Frequency (TF-IDF).
The TF-IDF is composed of two terms, first, Term Frequency (TF) - which
calculates the number of times a word appears in a document divided by the total
number of words in a documents. Secondly, Inverse Document Frequency (IDF)
which computes the logarithm of the total number of documents in dataset divided
by the number of documents where the specific term occurs. The product between
TF and IDF determines the importance and uniqueness of the word in the dataset and
provides high performance when combined with Bag-of –Word model (Wu et al.,
2008).
3.4.7 Training the Model
The study applied the collected data to train the selected algorithms as to configure
and devise different models which were then evaluated while classifying the test
dataset.
3.4.8 Evaluating the Models
The last phase was to evaluate the resulting models based on the performance
metrics. The evaluation metrics which were observed and recorded were True
32
Positive (TP), True Negative (TN), False Positive (FP), Accuracy, Precision-Recall,
f1-measure (f1) and ROC AUC. The models were evaluated by using 10-fold cross
validation in one case and by supplying 25% of the independent test dataset. The
study applied 10-fold cross validation because in different extensive and numerous
datasets with different learning techniques have shown 10 to be the right number of
folds to get the best estimate and has been widely used as the standard method in
practical (Ian et al., 2005).
3.5 Data Analysis
Both quantitative and qualitative data analysis techniques were used in this study.
The quantitative data as observed from the WEKA experimental environment were
exported into Excel for plotting tables, histograms and line graphs. For comparison
purpose, the paired t-test was performed. The paired t-test was used to compare the
performance of the classifiers on the 10-folds cross validations. For this case the
experiments were repeated 10 times making a total of 100 computations. The
average results were recorded. This analysis helped to make inference of the models
against performance metrics. The qualitative analysis helped to supplement on
quantitative data analysis. These results facilitated the process of making
comparisons, interpretations and inferences among different models.
3.6 Ethical Issues
The researcher asked for permission from the University of Dodoma administration.
Also the confidentiality and privacy of the message from respondents on
JamiiForum, YouTube and Facebook were preserved. For confidentiality and
privacy issues all the public pages and users accounts/phone numbers from which
posts and comments were extracted were not reported anywhere in this study. In
33
addition, all of the reviewed works were acknowledged. The study focused on
proving the research concepts rather than seeking to detect whether what the
respondents were doing was right or wrong.
3.7 Reliability and Validity
The relevant messages concerning improper behaviors from JamiiForum, YouTube
and Facebook were collected. To prove the validity of the study large number of
offensive messages were collected and the models developed were evaluated based
on the statistical evaluation metrics suggested from the literature. The experiment
was repeated 10 times and the average of the results were recorded. In addition, the
models for classification were supplied with separate test dataset and with 10-fold
cross-validation to check for their capability on detecting offensive messages, this
provided reliable results.
3.8 Conclusion
The chapter has explained the research design and strategy adopted to answer the
research questions. Experimentation settings and tools that were used have also been
discussed in details. The chapter also presents research methods, data collection
methods data analysis tools and techniques, ethical consideration, and validity and
reliability issues pertaining to this study. The next chapter will discuss the analysis
and findings in relation to the research questions under investigation.
34
CHAPTER FOUR
RESULTS AND DISCUSSION
4.0 Introduction
This chapter discusses the findings of the study in relation to the research questions.
The chapter is organized into three parts. The first part presents results and analysis
with respect to the research objectives for creating Kiswahili dataset, the second part
presents the results and analysis with respect to building and evaluating the
classification models. The last part presents the findings about proposed framework
for automated detection of offensive messages. All of the findings attempt to answer
the associated research questions set out in the introductory chapter.
4.1 Creating Kiswahili Dataset of offensive messages from social networks
4.1.1 Data Extraction from Social Networks
Since there was no dataset of offensive messages available for Kiswahili, the dataset
were collected by accessing publicly available social networks pages so as to acquire
training and testing datasets. The posts with their corresponding comments were
crawled by using Jsoup web crawler. A total of 12,000 messages were collected from
Social Networks sites in the period from 10-4-2017 to 10-7-2017 as depicted in
Table 4.1.
Table 4. 1: Message distribution
Social networks
# messages collected
Facebook
7500
Jamii Forum
3000
YouTube
1500
Total
12000
The study was concerned with the sexual and politics posts and comments categories
of offensive messages and normal messages for all categories. To be able to collect
35
and retrieve relevant sexual, politics, and normal messages the public pages to be
crawled were properly chosen. From Table 4.1, a total of the collected messages
contained all three categories which were further preprocessed to be used in other
parts of the study.
4.1.2 Messages Annotation
Before the collected messages were used to build the classification models, the
collected messages were annotated to give labels relevant to the machine learning
algorithms. Since the offensive language in a subjective term messages were given to
three annotators for labeling process. Thus, a message was manually labeled as
offensive or normal if there is a consensus between at least two of the three (3)
annotators. Thus, a total of 12000 messages were provided to the annotators to give a
label of 1 if the message is offensive and belongs to sexual category, a label of 2 if
the messages is offensive and belongs to politics category and 0 label if the message
is normal (non-offensive). In addition, the study considered only the pots and
comments for which all three annotators agreed upon the same label others were
eliminated from the list. This was done to ensure the ground truth of the dataset is of
best quality and facilitate the learning ability of the MLAs. After careful annotations
and consensus among annotators and eliminating posts with partial agreement only
11000 messages remained. Sample of 8000 messages were randomly selected with
2000 labeled as 1 (sexual), 1000 labeled as 2 (politics) and 5000 labeled as 0
(normal) which formed training data set as depicted in Table 4.2.
36
Table 4. 2: Training dataset distribution category-wise
Message type
Total
Label
Sexual
2000
1
Political
1000
2
Normal
5000
0
Total
8000
Based on the resulting data the percentage distribution of the offensive messages and
normal messages are depicted in Figure 4.1.
Figure 4. 1: Distribution of Messages in Training Dataset
The remaining 3000 of the total messages were used for testing the developed
models. Moreover the testing datasets were not manually given labeled as whether
they are offensive or non-offensive although their categories were manually
identified.
37
Table 4. 3: Testing dataset distribution
Message type
Total
Unlabeled
Sexual
700
?
Political
300
?
Normal
2000
?
Total
3000
4.1.3 Data Preprocessing
The labeled training dataset were first preprocessed into the format relevant for the
classification algorithms. Following are the techniques used to prepare and tune the
training data before using it as an input to various classifier algorithms in WEKA.
During preprocessing the text file containing posts and comments were converted
into Attribute-Relation File Format (ARFF) by using TextDirectoryLoader java class
in WEKA Simple GUI. ARFF file contains two attributes: text string that
represented a post or comment containing sexual, politics or normal messages and a
class attribute denoting the categories of 1, 2 and 0 which corresponds to the text
string. The sample of resulting ARFF file is presented in appendix 1.
Furthermore, the collected posts and comments had irrelevant features such as stop
words, special characters, alphanumeric, English words and html tags to mention but
a few. The text file containing stop words such as “na”, „kwaiyo‟”mimi‟ “wewe”,
“sisi” punctuations like „.‟ „,‟ and many more were created and used as a filter to
eliminate those words from the sentence and give precedence to the important words
in the training dataset. In addition to stop word list, duplicate words were also
eliminated so as to create lexicon. The corresponding text file presented in appendix
2 was uploaded into WEKA by using
38
weka. core. Stop words. Words From Files to pword Handlerclass library. Figure 4.2
shows the most frequent words in dataset.
Figure 4. 2: The most frequent words in dataset
4.1.4 Feature Representation, Extraction and Selection
The features were extracted and represented into an efficient and complete manner
in order to create numerical feature vector. The techniques which were adopted in
this study to create feature vector and select features respectively are Bag-of-Words
(BOW) model, N-gram, TF-IDF and Principal component Analysis. The BOW
model represent features in an ordered set of words, disregarding grammar and even
the exact position of the word. Each distinct word corresponds to a feature with the
frequency of the word in the document as its value. Only words that did not occur in
a stop list were considered as features.
The BOW model was combined with N-grams model to provide the ability to
identify n-word expressions. Unigram, Bi-gram and Tri-gram etc. were used to
represent subsequence of continuous words in the text and were applied as unit of
measurements towards classifiers‟ performance. The ratio of the number of times the
resulting tokens occurred in the text string and its important was measured by using
39
TF-IDF. Only the relevant features were selected by filters called String to Word
Vector Filter. The effective feature selection in text classification aimed at
improving the efficiency of the learning task and overall accuracy.
4.2 Build and evaluate model by applying some machine learning
algorithmsSelecting Machine learning algorithms and Configuration
A total of four machine learning algorithms were selected from the literature and
configured to carry out the experiment based on the dataset prepared in objective
one. The choice MLAs were based on the task at hand, similar studies from the
literature and nature of algorithms. The MLAs that were chosen and applied in this
study are:
i.
Naïve Bayes.
ii.
Random Forest.
iii.
Decision Tree (J48) and
iv.
Support Vector Machine (SVM) with Linear, Polynomial and Radial Basis
Function (RBF) Kernels.
Performance Evaluation Metrics
The evaluation metrics which were observed and recorded are True Positive (TP),
True Negative (TN): False Positive (FP): accuracy, Precision-Recall, f1-measure (f1)
and ROC AUC. The recorded results were based on 10-fold cross validation in some
cases and by supplying a separate test set in other cases. As depicted in Figure 7 the
training samples are not uniformly distributed, therefore accuracy alone was not
considered to be a strong measure for the performance of the classifiers, thus the
combination of accuracy, precision-recall and f1-measure were considered to convey
40
more information. Moreover the paired t-test was performed to analyze and compare
the performance of the models on each fold based on the mentioned metrics.
Experiment Result
4.2.1 Baseline classifiers performance
The dataset was supplied to the baseline classifiers in order to determine the initial
performance for evaluating the selected classifiers.
Table 4. 4: Baseline performance
Evaluation Metrics
zeroR
OneR
Accuracy
39.2746 %
47.8756%
TP Rate
0.393
0.479
FP Rate
0.394
0.338
Precision
0.307
0.547
Recall
0.393
0.479
ROC AUC
0.499
0.570
f1-Measure
0.277
0.377
From Table 4.4 it can be observed that oneR had better results as compared to zeroR.
Therefore, the performances of the selected machine learning algorithms were
evaluated by comparing them with the baseline performance. This finding helped to
either reject or accept the selected algorithms for further evaluations in this
experiment. A paired t-test was performed to evaluate the selected algorithms against
the baseline and the results were as follows.
41
Figure 4. 3: Performance Evaluation on Accuracy
From Figure 4.3 the result indicates that all of the selected algorithms are statistically
better (v) than the baseline algorithm (oneR) at specified significance level of 0.05.
In addition, the finding enabled the researcher to consider the selected algorithms for
further experiments.
4.2.2 Detecting Offensive Messages
The MLAs were trained to detect the presence or absence offensiveness in posts and
comments. Due to the existence of many factors that affect the classifiers
performance, the study strived to answer the following questions so as to properly
address research objective two. Firstly, how the performances of the models are
affected with the size of the training data? How data representations (Unigram, Bigram, Tri-gram and more) best describe the performance of the model? Is it easier to
detect offensive messages within the specific categories of posts and comments (i.e.
sexual or politics) than with a general-purpose model (configuration)? How the
results of the model coincide with the results of other scholars in the literature? The
answers to these questions are presented in the following section.
42
4.2.3 Size of Training Sample
The effect of the size of the training data on the performances of the models were
addressed by creating training sample of varying size and the Micro averaging f1measure of each model were recorded and plotted as shown in Table 6 and Figure 10
respectively. In WEKA tool a random samples of words to keep were selected as in
Table 4.5.
Table 4. 5: The performance of Classifiers on varying data size
Data Size
Micro Averaged f1-Measure
Naïve
J48
Bayes
SVM-
SVM-
SVM-
RBF
Polynomial Linear
Random
Forest
0
0.000
00
0.000
0.000
0.000
0.000
1000
0.731
0.774
0.605
0.705
0.789
0.779
2000
0.736
0.781
0.678
0.724
0.806
0.798
4000
0.739
0.790
0.700
0.780
0.830
0.813
6000
0.741
0.799
0.800
0.870
0.869
0.850
7000
0.741
0.821
0.879
0.930
0.903
0.931
8000
0.750
0.83
0.861
0.941
0.932
0.950
43
Figure 4. 4: Classifiers Learning Rate
From Figure 4.4 it can be observed that the classification accuracy (micro averaging)
of some classifiers increases significantly with the increase of the size of dataset
while decreases in some other classifiers. This finding signifies that MLAs are able
to devise a better learning function as more examples are presented. By applying 10fold cross validation SVM-Linear, SVM-Polynomial Kernels and Random Forest
provided
reasonable
results
with
SVM-Polynomial
and
Random
Forest
outperforming better than other classifiers. Since social network messages increases
as more people are connected to the OSNs, this finding signifies that any of these
generated models would work well in social network sites to detect offensive
messages despite the increase of number of posts and comments.
4.2.4 Feature Representations
The performances of the models were observed based on the feature representation
using BOW model, TF-IDF while varying n-grams. The models were trained on
total number of 8000 messages containing 37% offensive messages and 63% normal
44
messages and tested and evaluated using 10-fold cross validation as shown in Table
4.6 and Figure 4.5 respectively.
Table 4. 6: Classifiers performance based on Features Representation
Micro averaging f1-measure
Classifiers
BOW and TF-IDF
1-gram
2-gram
3-gram
4-gram
Naïve Bayes
0.686
0.757
0.563
0.721
J48
0.774
0.312
0.230
0.277
SVM-RBF
0.709
0.440
0.500
0.480
SVM-Polynomial
0.914
0.848
0.756
0.774
SVM-Linear
0.900
0.836
0.755
0.774
Random Forest
0.905
0.844
0.757
0.773
Figure 4. 5: Classifiers performance based on Features Representation
From Figure 4.5 the finding revels that with unigram (1-gram)
feature
representation, the f1-measure is higher with SVM-Linear with 90.000 %, SVMPolynomial 91.400% and Random Forest 90.500%. Furthermore, the finding
indicates that to detect offensive messages only a keyword or phrase may be enough
45
to represent the posts or comments as either offensive or normal. Unigram combined
with BOW and TF-IDF feature representation could save the purpose as compared
to other representations.
Since the study is about finding the a appropriate model for real time
implementation, a paired t-test was performed in order to find if the observed
differences in f1-measure with n-gram representation among SVM-Linear, SVMPolynomial and Random forest is statistically significant at specified 0.05 level of
confidence. The results of the findings are indicates in the Figure 4.7.
Figure 4. 6: Performance comparisons on n-gram feature
From Figure 4.6 it can be observed that there exist no statistical differences on the
performance among SVM-Polynomia Kernal and Random Forest that can be
explained because they have yield the same results on four different datasets. The
finding signifies that SVM-Normalized PolyKernel and Random Forest perform
equally with either n-gram representaion as opposed to other classifiers that have
worse results (*).
46
4.2.5 Categories
The performance of the models were also evaluated by observing if it is easier to
detect offensive messages within the specific categories of posts and comments (i.e.
sexual or politics) than with a general-purpose model (configuration). This finding
aimed at determining whether the system to be built for detecting offensive messages
should be domain specific or just a general-purpose system that can serve to all
categories of offensive in Kiswahili language. To perform this experiment two sets
of training dataset were prepared, one containing 8000 dataset labelled as
{1,0,2}and the second containing the same 8000 messages were re-labelled as 1
indicating offensive message and 0 as normal message {1,0}. The developed models
were tested by supplying a separate test set and the observed results were recorded
as in Table 4.7 and Table 4.8 respectively.
Table 4. 7: Performance on Dataset with 3-categories
Dataset with Sexual, Politics and Normal Categories {1,0,2}
Evaluation
Naïve
Metrics
Bayes
Accuracy
68.757 %
TP Rate
J48
SVM
RF
Linear
Polynomial
RBF
78.964
94.197 %
94.696 %
76.42%
95.026 %
0.688
0.790
0.942
0.942
0.764
0.950
FP Rate
0.229
0.122
0.03
0.079
0.153
0.028
Precision
0.685
0.789
0.942
0.943
0.807
0.951
Recall
0.688
0.790
0.942
0.942
0.764
0.96
ROC AUC
0.848
0.905
0.963
0.933
0.815
0.994
f1-Measure
0.75
0.785
0.942
0.947
0.720
0.950
47
Table 4. 8: Performance for Dataset with 2-categories
Dataset with 2-Categories {1,0}
Evaluation
Naïve
Metrics
Bayes
Accuracy
82.1555%
TP Rate
J48
SVM
RF
Linear
Polynomial RBF
80.772%
95.583 %
92.4028 %
78.3569%
94.258 %
0.822
0.808
0.956
0.924
0.784
0.943
FP Rate
0.320
0.250
0.112
0.235
0.679
0.156
Precision
0.820
0.806
0.956
0.930
0.817
0.943
Recall
0.822
0.808
0.956
0.924
0.784
0.943
ROC AUC
0.875
0.876
0.922
0.844
0.552
0.986
f1-Measure
0.821
0.804
0.955
0.919
0.673
0.941
Figure 4. 7: Performance for General purpose and Categorical models
From Figure 4.7, it can be observed that by limiting training and testing dataset into
a single category there is a significant difference on the performances of the
classifiers. According to (Sood et al., 2012) on their study did not observe any
significance difference regarding general purpose and categorical systems. The
finding signifies that the choice of weather to build a categorical or general purpose
system will depend on the decision from the one intending to implement the system
48
into real world environment. The implementation of two classes (1, 0) achieves
better results when implemented with SVM-Linear, while SVM-Polynomial and
Random Forest yield better results with more than two categories.
Figure 4. 8: Comparison of False Positive Rate
From Figure 4.8 a paired t-test was performed to compare the false positive rate on
two different datasets from the baseline classifier. As observed the lowest FP rate
from baseline was 0.02 (2.0%) this indicates that the two classifiers wrongly
classifies normal messages as offensive by 2.0%, 1.6 % and 1.7 % respectively at
specified significance level of 0.05.
4.2.6 Time Taken to Train and Test model
Since the models utilize the CPU, the comparison analysis was performed to
determine how the model utilizes the CPU during training and testing period as
depicted in Figure 4.9.
49
Figure 4. 9: Time taken to train and test model
From Figure 4.9 it can be observed that SVM-Polynomial is good at training time
with average of 2.28 seconds and 2.49 seconds on two different datasets but in case
of prediction time; Random Forest gives the best results with average of 0.15
seconds and 0.13 seconds respectively.
50
From the above graphs and tables, all four selected algorithms provided different
results. It can also be observed that, Random Forest gives the slightly better
performance in average of 95.03% correct classification as compared to SVMKernels functions and other algorithms. Since the Random Forest outperformed
other classifiers, the model was selected as an input in the objective three.
4.3 Proposed framework for detecting offensive Kiswahili messages
4.3.1 Proposed Framework Architecture
Figure 4. 10: Study Proposed Framework
4.3.2 Components details of the proposed framework
The Figure 4.10 depicts the proposed framework toward detecting offensive
Kiswahili messages from social network sites. The architecture consists of Input
51
section that contains raw messages originating from social network sites such as
Facebook, Twitter, YouTube and JamiiForum. OSNs are connected to the REST API
that provides a POST method to submit messages for the analysis. Some of the
OSNs provide API to crawl data while others do not. In order to extract message the
web crawler can be developed to link with the proposed API.
Once a post is submitted into feature extraction module, feature vectors are
constructed by using Bag-of Word Tokenizer (string to feature vectors), TF-IDF and
unigram. The script then load the pre-stored Random Forest classifier model stored
in a database server so as to classify the post as either offensive or normal. This part
forms the processing component of the framework.
The message may be extracted with its corresponding userID, which may then be
the database once the message is classified as offensive. This part will help to save
as evidence once requested. In addition, JSON object indicating as a message is
offensive is then sent into the publicly pages as marking it as offensive this form
output part of the API. Furthermore, for real time classification a more seamless
browser application may be developed specifically for each browser that can
communicate with the API to send the post to the back-end as described above.
4.3.3 Framework Properties
In terms of reliability the proposed framework, is reliable since it depends on the
evaluated model which could detect 95% offensive messages correctly. The model
has achieved a false positive rate of 2.80% signifying that it wrongly classify normal
message as offensive by 2.80%. Furthermore, the implementation of the proposed
framework will result on correct predicting the offensive message by the Precision of
95.10% and Recall of 95%. Since the model implements Random Forest classifier
52
which uses bagging and bootstrap these techniques will minimize the error of
incorrectly assignment of a message.
For the case of robustness the framework proposed, is much of the open sources
packages for implementing machine learning classifiers are available in Java, Python
or R which may be integrated with databases servers to handle unstructured data, all
of these makes the proposed framework robust because of the properties contained in
these technologies.
For issues related to usability, if the proposed framework is implemented correctly
no user will experience any difficultness since the whole process is automated one
and the output results will just appear on the page.
4.3.4 Implementation Consideration
To implement the proposed application, detection and quick response factors need to
be considered. In addition, real time classification should also be considered. This is
because the process involves identifying the messages and returning the feedback on
the client side as quick as possible to facilitate real time detection. The process may
involve either marking the post or blocking the post before being shared.
In social network sites not all contents posted are offensive. Moreover, OSNs allow
people to communicate privately; it is therefore recommended that the application be
designed to adhere with privately communicated content which users do not intend
to share publicly. This will help to enhance the goals of OSNs.
Computing resources and Bandwidth are among other considerations. The
implementation of the above framework will need to be hosted in powerful server
with reasonable storage capacity, RAM, processors and amount bandwidth. These
53
properties will facilitate faster computing process and storages of models, for quick
responses.
4.4 Conclusion
This chapter was about discussing the findings with respect to the research
objectives and associated research questions set out in the introductory chapter. It
has presented findings with respect to Kiswahili dataset, configuration and
evaluation of the selected machine learning algorithms. Based on the study objective
one and two the chapter presents the proposed framework for automating the
detection of offensive messages. In the next chapter a summary of the study,
conclusion, recommendations and areas for further research will be presented.
54
CHAPTER FIVE
SUMMARY, CONCLUSION AND RECOMMENDATION
5.0 Introduction
In this chapter a summary for the study, conclusions based on the findings presented
and discussed in chapter four; recommendations and areas that may need further
research are presented.
5.1 Summary of the Study
The study specifically focused on designing a framework by applying Machine
Learning Algorithms (MLA) that can automatic detect offensive massages on social
networks in Kiswahili language. Specifically the study examined three research
questions in order to accomplish the above objective namely; (i) How Kiswahili
dataset of offensive messages from social networks can be created for generating
feature vectors? (ii) How machine learning algorithms can be applied and evaluated
in building the models which can detect offensive messages in Kiswahili? (iii) How
an architecture framework for detecting offensive language on social networking
sites can be designed? An experimental research design was applied; employing both
primary and secondary data collected by means of software tool, observation and
structured documentary review methods in order to address the above mentioned
questions.
With regard to the first objective which focused on creating Kiswahili dataset of
offensive messages from social networks, a total of 12000 Kiswahili messages from
Facebook, JamiiForum were collected in which 37% of the total messages were
offensive (both from the perspective of sexuality and politics) while 63% of all
messages were normal. The collected messages were given to three annotators to
55
manually assign the label1for sexual, 2 for politics and 0 for normal messages
respectively. The result of the three annotators formed a ground truth for creating the
quality Kiswahili dataset for training and testing MLAs.
In response to the second objective which aimed at building and evaluating the
model by applying some machine learning algorithms techniques , four classification
algorithms were selected namely Naïve Bayes, Decision Tree (J48), Random Forest,
Support Vector Machine with Linear, Polynomial and RBF kernels respectively.
The findings revealed that in all of the applied machine learning algorithms in the
experiment, Random Forest was capable of correctly assigning a message into a
correct class with accuracy of 95.03% Recall of 95.00%, Precision of 95.10% and f1Measure of 0.950 (95.00%), false positive rate of 2.80 % and outperformed all other
classifiers applied in the experiment.
The first two objectives served as inputs to the third objective which aimed at
proposing a framework for automating the detection of offensive messages in social
networks under Kiswahili settings by applying some selected Machine Learning
Algorithms. As depicted in section 4.3 the proposed framework is a RESTful API
that takes post from social networks as input and passes it into a trained model stored
in the database server. The model predicts the messages as either offensive or normal
and stores the results into the database if it is offensive and send the mark as JSON
string to the public page to mark the post or stop it from being spread. Otherwise it
ignores the messages.
5.2 Conclusion
Based on the study findings and analysis made the conclusions discussed hereunder
are pertinent.
56
i.
The study has created Kiswahili dataset containing sexually and politically
offensive messages collected from few of the existing social networks. The
study has examined only verbally presented messages despite the existence of
messages in the form of images that are dominant and shared among people in
social network. The researcher has also observed that reliable dataset contains
huge amount of relevant messages are necessary for the machine learning
algorithms to produce reliable results.
ii. The study has applied supervised machine learning technique to build and
evaluate few of the selected text classification algorithms. Through observed
metrics the findings reveals the results from SVM-Polynomial Kernel
(normalized) and Random Forest had reasonable results with Random forest
outperforming slightly better than SVM-Polynomial with 0.05 confidence level
and low standard derivation of 0.02-0.04.
iii. The created Kiswahili dataset and evaluated Random Forest model formed
important components among others in
designing a framework that would
help in mitigating the detection of offensive messages in social networks sites
5.3 Recommendations
Based on the findings of the study and the conclusions drawn, the study recommends
the following.
First, social network provider should adopt the proposed framework and deploy it as
the part of their application so as to mitigate offensive behaviors in social networks.
This will help users to trust and use their application(s) and improve users‟
experiences on using online services. The model is also more robust and reliable as it
eliminates manual activities in detecting offensive behaviors.
57
Second, the government may adopt the framework presented to create a system that
stores evidence from users who behave offensively in social networks. The
framework will eliminate human intervention that may be influenced by personal
malice and hatred, reduce labor intensive methods of detecting offensive behaviors
and encourage balanced judgments.
Third, end-users should stop promoting offensive words which is against social
network terms of service. An important aspect of the framework is to mark or block
the offensive posts thus impressing on the end-users that they should adopt the
implemented services in positive ways.
5.4 Area for Further Research
Based on the findings of the study and the conclusions drawn, the study also
identified areas that may require further research.
i.
A study needs to be conducted in order to implement the proposed framework
in
real-time
social
networks
(Facebook,
JamiiForum
or
YouTube)
environments so as to assess the efficiency of the framework on detecting
offensive messages in Kiswahili context. An API can be developed to
implement the proposed framework by considering the discussed designing
techniques. The study may strive to evaluate the metrics discussed in the study
and measure computation time, delays and evaluates its adaptability in the
Tanzania context.
ii.
In social networks, people are using multilingual message constructions such
as code switching between Kiswahili and English in a single message and
images to send offensive information. Moreover, people are using short form
sentences to convey the messages that may be offensive. Due to those
58
mentioned issues a similar study may be conducted by adding multilingual,
slangs, images to form dataset collected over long period of time so as to form
feature vectors for the classification task. The study may apply the same or
different algorithms or evaluate the results of other domain of machine
learning such as clustering.
iii.
The study did not consider multi-labeling task, where by a single message may
belong into different categories of offensiveness thus a similar study may be
conducted by applying multi-labeling classification techniques.
59
REFERENCES
Ali, J., Khan, R., Ahmad, N., & Maqsood, I. (2012). Random Forests and Decision
Trees. IJCSI International Journal of Computer Science Issues, 9(5),
272–278.
Alpaydın, E. (2010). Introduction to Machine Learning Second Edition (second).
The MIT Press.
Asur, S., & Huberman, B. A. (2010). Predicting the Future with Social Media. 2010
IEEE/WIC/ACM International Conference on Web Intelligence and
Intelligent Agent Technology, 492–499. https://doi.org/10.1109/WIIAT.2010.63
Barcaroli, G., Nurra, A., Scarnò, M., Summa, D., & Nazionale, I. (2014). Use of web
scraping and text mining techniques in the Istat survey on “ Information
and Communication Technology in enterprises .” In European
Conference on Quality in Official Statistics.
Biau, G. erard. (2012). Analysis of a Random Forests Model. Journal ofMachine
Learning Research 13, 1063–1095.
Breiman, L. E. O. (2001). Random Forests. Machine Learning,.45, 5–32. Retrieved
from http://dx.doi.org/10.1023/A:1010933404324.
Bretschneider, U., & Peters, R. (2017). Detecting Offensive Statements towards
Foreigners in Social Media. In: Proceedings of the 50th Hawaii
International Conference on System Sciences (HICSS), 2213–2222.
Retrieved from uri: http://hdl.handle.net/10125/41423.
Chao, W. (2011). Machine Learning Tutorial. DISP Lab, Graduate Institute of
Communication Engineering, National Taiwan University, DISP Lab,.
Retrieved from http://disp.ee.ntu.edu.tw/~pujols/Machine Learning
Tutorial.pdf.
Chen, Y. (2012). Detecting Offensive Language in Social Media to Protect
Adolescent Online Safety.2012 ASE/IEEE International Conference on
Privacy, Security, Risk and Trust, 71–80. https://doi.org/10.1109/
SocialCom-PASSAT.2012.55.
Christopher, B. (2006). Pattern Recognition and Machine Learning. (J. Michael, K.
Jon, & S. Bernhard, Eds.). Springer Science+Business Media, LLC.
Creswel, J. (2014). Research Design: Qualitative,Quantitative, and Mixed Methods
Approches (4th Editio). SAGE Publications.
Criminisi, A., Shotton, J., & Konukoglu, E. (2012). Decision Forests : A Unified
Framework for Classification , Regression , Density Estimation ,
Manifold Learning and Semi-Supervised Learning By Antonio Criminisi
, Jamie Shotton , and Ender Konukoglu. Computer Graphics and Vision,
60
7(201), 81–227. https://doi.org/10.1561/0600000035.
Danjuma, K., & Osofisan, A. (2015). Evaluation of Predictive Data Mining
Algorithms in Erythemato-Squamous Disease Diagnosis.
Dasgupta, A., & Nath, A. (2016). Classification of Machine Learning Algorithms,
3(3), 6–11.
Dewan, P. (2017). Facebook Inspector ( FbI ): Towards automatic real-time
detection of malicious content on Facebook. Social Network Analysis
and Mining. https://doi.org/10.1007/s13278-017-0434-5.
Ellison, N. B., & Boyd, danah m. (2008). Social Network Sites : Definition , History
, and Scholarship. Journal of Computer-Mediated Communication 210
Journal, 13, 210–230.
https://doi.org/10.1111/j.1083-6101.2007.
00393.x.
Gerbet, T., & Kumar, A. (2014). ( Un ) Safe Browsing.
Hee, C. Van, Lefever, E., Verhoeven, B., Mennes, J., & Desmet, B. (2015).
Automatic Detection and Prevention of Cyberbullying. The First
Inernational Conference on Human and Social Analytics, (c), 13–18.
Hilte, L., Lodewyckx, E., Verhoeven, B., & Daelemans, W. (2016). A Dictionarybased Approach to Racism Detection in Dutch Social Media.
Proceedings of the Workshop on Text Analytics for Cybersecurity and
Online Safety (TA-COS 2016), 11–17.
Hinnebusch, Thomas J. (2003). Swahili In William J. Frawley. International
Encyclopedia of Linguistics (2 ed.). Oxford: Oxford University Press.
Ian, W., & Frank, E. (2005). Data Mining Practical Machine Learning Tools and
Techniques (second). by Elsevier Inc.
Jia, S., Hu, X., & Sun, L. (2013). The Comparison between Random Forest and
Support Vector Machine Algorithm for Predicting β -Hairpin Motifs.
Engeineering, 5(October), 391–395. https://doi.org/.doi.org/10.4236
/eng.2013.510B079.
Killam, R., Cook, P., & Stakhanova, N. (2016). Android Malware Classification
through Analysis of String Literals. TA-COS 2016 – Text Analytics for
Cybersecurity and Online Safety. Retrieved from http://www.ta-cos.org.
Kotsiantis, S. B. (2007). Supervised Machine Learning : A Review of Classification
Techniques, 31, 249–268. Retrieved from https://datajobs.com/datascience-repo/Supervised-Learning-[SB-Kotsiantis].pdf.
Kumari, M., & Godara, S. (2011). Comparative Study of Data Mining Classification
Methods in Cardiovascular Disease Prediction. International Journal of
Computer Science and Technology, 2(2), 304–308. Retrieved from
61
http://ef.untz.ba/images/Casopis/Paper1Osmanbegovic.pdf.
Laskari, N., & Sanampudi, S. (2016). Aspect based sentiment analysis. ThesisIOSR
Journal of Computer Engineering (IOSR-JCE), 18(2), 72–75.
https://doi.org/10.9790/0661-18212428.
LegalAid. (2010). Social Networks Terms of Services.
Liu, B. (2012). Sentiment Analysis and Opinion Mining “Synthesis lectures on
human language technologies .” (S. Editor:, Ed.), AAAI-2011 Tutorial.
Morgan & Claypool. https://doi.org/10.2200/S00416 ED1V01Y2012
04HLT016.
Lutu, P. (2015). Web 2.0 Computing and SocialMedia as Solution Enablers for
Economic Development in Africa. Computing in Research and
Development in Africa: Benefits, Trends, Challenges and Solutions,
(Springer International Publishing Switzerland 2015). https://doi.org/
DOI 10. 1007/978-3-319- 08239-4 6.
Massamba, D.P.B., Y.M. Kihore, & J.I. Hokororo (eds) (1999) Sarufi miundo ya
Kiswahili Sanifu : Sekondari na Vyuo. Dar es Salaam : Taasisi ya
Uchunguzi wa Kiswahili, Chuo Kikuu cha Dar es Salaam
Mulokozi,Mugyabuso (2000) Language, Literature and the Forging of aPan-African
Identity. Kiswahili 63: 71-80.
Msavange, M. (2015). Usage of Cell Phones in Morogoro Municipality , Tanzania.
Journal of Information Engineering and Applications, 5(7), 52–66.
Retrieved from www.iiste.org.
Muhammad, I., & Yan, Z. (2015). SUPERVISED MACHINE LEARNING
APPROACHES : A SURVEY, 946–952. https:// doi.org/10. 21917 /ijsc.
2015. 0133.
Nakash, J., Anas, S., Ahmad, S. M., & Azam, A. M. (2015). Real Time Product
Analysis using Data Mining. International Journal of Advanced
Research in Computer Engineering & Technology (IJARCET), 4(3),
815–820.
Nasa, C. (2012). Evaluation of Different Classification Techniques for WEB Data.
International Journal of Computer Applications, 52(9). Retrieved from
http://www.ijcaonline.org/archives/volume52/number9/8233-1389.
Nilsson, N. J. (2005). INTRODUCTION TO MACHINE LEARNING AN EARLY
DRAFT OF A PROPOSED TEXTBOOK Department of Computer
Science.
Osmanbegović, E., & Suljić, M. (2012). Data mining approach for predicting student
performance. Economic Review – Journal of Economics and Business,
X(1), 3–12. Retrieved from http://ef.untz. ba/images/ Casopis/ Paper1
62
Osmanbegovic.pdf.
Papegnies, E., Labatut, V., Dufour, R., Linar, G., Papegnies, E., Labatut, V., …
Linar, G. (2017). Detection of abusive messages in an on-line
community es To cite this version:Conference En Recherche
d’Information et Applications, 0–16.
Ramya, M., & Pinakas, J. (2014). Different Type of Feature Selection for Text
Classification. Ijcttjournal.Org, 10(2), 102–107. Retrieved from
http://www.ijcttjournal.org/Volume10/number-2/IJCTT-V10P118.pdf.
Razavi, A. H., Inkpen, D., Uritsky, S., & Matwin, S. (2010). Offensive Language
Detection Using Multi-level Classification. Advances in Artificial
Intelligence: 23rd Canadian Conference on Artificial Intelligence.
Read, J. (2016). Meka : A Multi-label / Multi-target Extension to Weka, 17, 1–5.
Reynolds, K. (2012). Using Machine Learning to Detect Cyberbullying Kelly
Reynolds.
Saleem, H. M., Dillon, K. P., Benesch, S., & Ruths, D. (2016). A Web of Hate :
Tackling Hateful Speech in Online Social Spaces. First Workshop on
Text Analytics for Cybersecurity and Online Safety (TA-COS 2016),
Proceeding.
Sathya, R., & Abraham, A. (2013). Comparison of Supervised and Unsupervised
Learning Algorithms for Pattern Classification, 2(2), 34–38.
Saunders, M., Lewis, P., & Thornhill, A. (2009). Research Methods for Business
Students. Research methods for business students (5 Edition). Essex,
England: Pearson Education Limited. https://doi.org/10.1007/s13398014-0173-7.2.
Seif, H. (2016). Na{"\i}ve Bayes and J48 Classification Algorithms on Swahili
Tweets: Perfomance Evaluation. International Journal of Computer
Science and Information Security, 14(1).
Sood, S. O., Churchill, E. F., & Antin, J. (2012). Automatic identification of personal
insults on social news sites. Retrieved from https://pdfs. semanticscholar.
org/3fa4/d63e0194cdbd909c579456830e0a7c909242.pdf
Srivastava, S. (2014). Weka : A Tool for Data preprocessing , Classification ,
Ensemble , Clustering and Association Rule Mining, 88(10), 26–29.
https://doi.org/10.5120/15389-3809.
Tanzania. (2015). The cybercrimes act, 2015, (14).
TCRA. (2010). THE UNITED REPUBLIC OF TANZANIA Report on INTERNET
AND DATA SERVICES IN TANZANIA A Supply-Side Survey,
(September).
63
Tesha, T. (2015). The Impact of Transformed Features in Automating the Swahili
Document Classification. International Journal of Computer
Applications, 127 (16), 37–42.
Tesha, T., & Baraka, K. (2015). Analysis of Tanzanian Biomass Consumption Using
Artificial Neural. Fundamentals of Renewable Energy and Applications,
5(4). https://doi.org/10.4172/20904541.1000169.
Vandersmissen, B. (2012). Automated detection of offensive language behavior on
social networking sites. Universiteit gent.
Vanhove, T., Leroux, P., Wauters, T., & Turck, F. (2013). Towards the Design of a
Platform for Abuse Detection in OSNs using Multimedial Data Analysis.
In Integrated Network Management. IFIP/IEEE International
Symposium on Integrated Network Management (IM 2013).
World Newsmedia Network. (2015). Global Social Media Trends 2015. European
Publishers
Council.
Retrieved
from
http://epceurope.eu/wpcontent/uploads/2015/09/epc-trends-social-media.pdf.
Wu, H. O. C., Wing, R., & Luk, P. (2008). Interpreting TF-IDF Term Weights as
Making Relevance Decisions. ACM Transactions on Information
Systems, 26(3), 1–37. https://doi.org/ 10.1145/1361684. 1361686.
Xu, R., & Wunsch, D. (2005). Survey of Clustering Algorithms. IEEE
TRANSACTIONS ON NEURAL NETWORKS,16(3), 645–678.
Zhang, Y., & Haghani, A. (2015). A gradient boosting method to improve travel
time prediction. TRANSPORTATION RESEARCH PART C. https://
doi.org/ 10.1016/j. trc.2015.02.019.
64
APPENDICES
Appendix 1: Sample arff Message file
65
Appendix 2: Sample list of stop words
na
ili
hivyo
kwa
au
hiyo
ama
letu
hizo
ndiyo
kwenye
ikiwa
haya
kwamba
ipo
la
iwe
kati
aghalabu
kama
naye
ukubwa
hasa
katika
kuwa
huku
kila
sana
ajili
kuna
lakini
baadhi
mwa
si
hayo
sasa
cha
hii
sisi
za
vya
mimi
yote
sababu
wewe
yetu
ni
nyie
ya
yenye
ninyi
wote
yenyewe
siyo
wa
bali
wengi
yake
hili
wenye
yao
hivi
wingi
66
yoyote
zetu
ndiye
baadaye
ingekuwa
nyingi
hata
petu
hadi
hali
hawa
huyu
hicho
hizi
hilo
halisi
baada
hiki
ambapo
yuko
huo
huyo
nyingine
nzuri
chake
zote
yupo
wakati
ikawa
ambacho
pia
ambayo
akiwa
chenye
ila
ile
ambaye
pa
tu
zipo
ziko
hako
nao
yale
vizuri
vingi
kingi
huu
kubwa
watakuwa
uko
ukubwa
nzuri
kizuri
ambako
ambao
ambazo
hapa
hapo
nao
nalo
husika
haba
nacho
nani
uwe
hakika
halafu
up
ule
yeye
hao
yapo
yule
huna
yaweyana
huko
kile
humo
hana
awapo
zake
wale
chetu
yana
67
yako
wao
nina
yangu
wapi
nini
zozote
wangu
nipo
zile
wana
ndogo
zenye
wala
ndio
zikiwa
vyao
ndipo
zangu
vile
ndivyo
zaidi
vema
nayo
yeyote
upo
nazo
yangu
una
lisilo
yasiyo
u
litakuwa
yaliyokuwa
tuna
lini
yaliyopo
the
lipo
yamekuwa
tena
likiwa
yaani
tags
lile
yako
send
lilikuwa
wetu
sawa
limekuwa
wenyewe
peke
licha
wenzake
pekee
lenye
wenzao
papo
ni
wengine
pale
huyo
wawe
ole
wa
wasio
nzima
na
68
wala
tired
tena
aidha
like
sana
vilevile
fuck
tulia
zaidi
gdnyt
pana
bali
fakers
kubwa
lakini
kwan
vip
wewe
ama
wapi
ww
n
oho
ni
mpaka
duh
kwa
and
ahaa
naye
but
kweli
yeye
bt
kwake
wao
or
kwani
jana
yangu
kwanini
leo
jamani
kote
kesho
weka
juu
usiku
namba
je
mchana
kwake
is
asubuhi
vile
isiyo
au
haujui
itakuwa
vp
huyu
ina
ww
jamani
ingawa
am
sisi
lkn
69
wenzetu
kumi
kwao
wapo
ishirini
kwenda
wako
thelathini
wakiwa
arobaini
wake
hamsini
vyote
tisini
vipi
sitini
vingine
sabini
upya
themanini
to
mia
moja
elfu
mbili
sio
tatu
on
nne
mzima
tano
mzuri
sita
saba
namba
nane
namna
tisa
mwao
70
Appendix 3: Corrections Report as per External Supervisors Observation
Presented Chapters
(a) Chapter Three
Comments from
External Supervisor
To
avoid
Research
Methodology
Methodology
literature
Correction
Area done by
Candidate
turning Mentioned
part
as parts
review.
Page
number.
Page number
were 29
(section
by 3.1-3.3)
corrected
Recommended to either using citations
delete, shift into literature
for
or use for justification
justifications
(b) Chapter Four: To improve Quality of
Results
Quality
and some figures
figures
Discussion
(c) Dissertation
of Page number
were 48, 51,
improved
 Grammar
and All of the noted
Presentation
other
issues
and Writing
highlighted in the rectified,
dissertation
issues
Entire
were dissertation
including
 use chapter-wise spacing,
format for figures comma,
and tables
 improve quality
of some figures
71
grammar issues
etc.
(3,18,19)
Download