A FRAMEWORK FOR AUTOMATED DETECTION OF OFFENSIVE MESSAGES IN SOCIAL NETWORKS IN KISWAHILI EVERYJUSTUS BARONGO MASTER OF SCIENCE IN COMPUTER SCIENCE THE UNIVERSITY OF DODOMA OCTOBER, 2017 A FRAMEWORK FOR AUTOMATED DETECTION OF OFFENSIVE MESSAGES IN SOCIAL NETWORKS IN KISWAHILI By Everyjustus Barongo A Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science of the University of Dodoma The University of Dodoma October, 2017 CERTIFICATION The undersigned certify that they have read and hereby recommend for the acceptance by the University of Dodoma, this dissertation entitled, “A Framework for Automated Detection of Offensive Messages in Social Networks in Kiswahili” in partial fulfillments of the requirements for award of the Master‟s Degree of Science in Computer Science at the University of Dodoma. ……………………………………… PROF. LEONARD MSELLE (SUPERVISOR) …………………………………… DR. MAJUTO MANYILIZU (SUPERVISOR) Date……………………….………… i DECLARATION AND COPYRIGHT I, Everyjustus Barongo, declare that this dissertation is my own original work, and that it has not been presented and will not be presented to any other university or institution, for a similar or any other degree award. Signature …………………………………… No part of this dissertation may be reproduced, stored in any retrieval system, or transmitted in any form or by any means without prior written permission of the author or the University of Dodoma. ii ACKNOWLEDGEMENTS First, I thank God the Almighty, for keeping me alive and healthy. Secondly, I owe the deepest gratitude to my supervisors Prof. L. Mselle and Dr.M. Manyilizu for their support. Their guidance helped me in all of the time of this dissertation. My sincere thanks also go to Mr. T. Tesha and Miss. Basilisa for sharing with me important ideas and resources like sample stop list file and technological background (Jsoup).Their support is invaluable and highly appreciated. Finally, I would like to extend my appreciation to my father, brothers and sisters for their moral support and encouragements they gave me from the beginning till the end of my master‟s study. iii DEDICATION Dedicated to the memory of my mother Ester T. Katebe iv ABSTRACT The diffusion of information generated in Social Networks Sites is the result of more people being connected. The connected people chats and comments by posting contents like images, video and messages. In fact the social networks have been and are useful to communities in such they bring relatives together especially in sharing experiences and feelings. Although social networks have been beneficial to users, some of the shared messages and comments contain sexual and political harassments. This is particularly the same in Kiswahili speaking countries like Tanzania. Most if not all of the Kiswahili social networks sites, the offensive messages have been and are publicly posted. These messages harass, embarrass, and even assault users and to some extent lead to psychological effect. This study propose a framework for automating the detection of offensive messages in social networks in Kiswahili settings by applying some selected machine learning techniques. Specifically the study created Kiswahili dataset containing sexual and political offensive messages and normal messages1. All of these messages were collected from Facebook, YouTube and JamiiForum and they were used for evaluating the performance of the selected text classification algorithms. The collected messages were preprocessed by using Bag-of-Word (BoW) model, Term Frequency Inverse Document Frequency (TF-IDF) and N-grams techniques to generate feature vectors. The experimental findings using the generated feature vectors, showed that, the Random Forest classifier was capable of correctly assigning a message into a correct class label with accuracy of 95.0259% ,f1Measure of 0.950 (95.0%) and false positive rate of 2.8 % when applied to three categorical dataset. On the other hand, the SVM-Linear showed better results when applied in two categorical data. The study suggests the REST API based framework with random forest classifier and Kiswahili dataset to be deployed in real social networks sites to facilitate the real-time detection of offensive messages. 1 Clean messages which doesn‟t contain any form of offensive strategy v TABLE OF CONTENTS CERTIFICATION ........................................................................................................ i DECLARATIONANDCOPYRIGHT .......................................................................... ii ACKNOWLEDGEMENTS ........................................................................................ iii DEDICATION ............................................................................................................ iv ABSTRACT ................................................................................................................. v LIST OF TABLES ....................................................................................................... x LIST OF FIGURES..................................................................................................... xi LIST OF APPENDICES ............................................................................................ xii LIST OF ABBREVIATIONS ................................................................................... xiii CHAPTER ONE: INTRODUCTION ...................................................................... 1 1.0 Introduction ............................................................................................................ 1 1.1 Background of the Study........................................................................................ 1 1.1.1 Web 2.0 Technologies and Social Media ............................................................ 2 1.1.2 Statistical Usage of Social Networks Sites ......................................................... 3 1.2 Statement of the Problem ....................................................................................... 6 1.3 Objective of the Study ............................................................................................ 7 1.3.1 General Objective................................................................................................ 7 1.3.2 Specific Objectives.............................................................................................. 7 1.4 Research Questions ................................................................................................ 8 1.5 Significance of the Research .................................................................................. 8 1.5.1 To other Researchers and Academicians ............................................................ 8 1.5.2 To the OSNs Providers and Developers.............................................................. 8 1.5.3 To the Policy Makers and Law Enforcers ........................................................... 9 1.5.4 To the OSNs Administrators and Normal Users ................................................. 9 1.6 Scope of the Study ................................................................................................. 9 1.7 Limitations of the study ......................................................................................... 9 1.8 Organization of the Study .................................................................................... 10 CHAPTER TWO: LITERATURE REVIEW ....................................................... 11 2.0 Introduction .......................................................................................................... 11 2.1 Conceptual definitions ......................................................................................... 11 vi 2.1.1 Offensive Language .......................................................................................... 11 2.1.2 Web Crawling and Data Extraction .................................................................. 12 2.1.3 Machine Learning ............................................................................................. 12 2.1.3.1 The Categories of Machine Learning Methods .............................................. 13 2.1.3.2 Supervised Learning ...................................................................................... 13 2.1.3.3 Unsupervised Learning .................................................................................. 14 2.1.3.4 Reinforcement Learning ................................................................................ 14 2.2 Text Classification Algorithms ............................................................................ 14 2.2.1 Naïve Bayes ...................................................................................................... 15 2.2.2 Support Vector Machine (SVM) ....................................................................... 16 2.2.3 Artificial Neural Networks- Multilayer Perceptrons (ANN-MLP) ................... 17 2.2.4 Random Forest .................................................................................................. 18 2.2.5 Decision Tree Classifier (J48) ........................................................................... 20 2.2.6 Baseline Classifiers ........................................................................................... 20 2.3 Model Evaluation ................................................................................................. 21 2.3.1 Evaluation Metrics ............................................................................................ 21 2.3.2 Cross-validation ................................................................................................ 23 2.4 Empirical Studies ................................................................................................. 23 2.5 Research Gap ....................................................................................................... 26 2.6 Conceptual Framework ........................................................................................ 27 2.7 Conclusion ........................................................................................................... 28 CHAPTER THREE: METHODOLOGY .............................................................. 29 3.0 Introduction .......................................................................................................... 29 3.1 Research Design ................................................................................................... 29 3.2 Research Setting ................................................................................................... 29 3.3 Research Approach .............................................................................................. 29 3.4 Data Collection Method and Tool ........................................................................ 30 3.4.1 Primary Data ..................................................................................................... 30 3.4.2 Secondary Data ................................................................................................. 30 3.4.3 Experiment Tool and Environment ................................................................... 30 3.4.4 Experimentation Steps ...................................................................................... 31 3.4.5 Data Preprocessing ............................................................................................ 31 vii 3.4.6 Feature Extraction ............................................................................................. 32 3.4.7 Training the Model ............................................................................................ 32 3.4.8 Evaluating the Models....................................................................................... 32 3.5 Data Analysis ....................................................................................................... 33 3.6 Ethical Issues ........................................................................................................ 33 3.7 Reliability and Validity ........................................................................................ 34 3.8 Conclusion ........................................................................................................... 34 CHAPTER FOUR: RESULTS AND DISCUSSION ............................................ 35 4.0 Introduction .......................................................................................................... 35 4.1 Creating Kiswahili Dataset of offensive messages from social networks ........... 35 4.1.1 Data Extraction from Social Networks ............................................................. 35 4.1.2 Messages Annotation ........................................................................................ 36 4.1.3 Data Preprocessing ............................................................................................ 38 4.1.4 Feature Representation, Extraction and Selection ............................................ 39 4.2 Build and evaluate model by applying some machine learning algorithms......... 40 4.2.1 Baseline classifiers performance ....................................................................... 41 4.2.2 Detecting Offensive Messages .......................................................................... 42 4.2.3 Size of Training Sample .................................................................................... 43 4.2.4 Feature Representations .................................................................................... 44 4.2.5 Categories .......................................................................................................... 47 4.2.6 Time Taken to Train and Test model ................................................................ 49 4.3 Proposed framework for detecting offensive Kiswahili messages ...................... 51 4.3.1 Proposed Framework Architecture ................................................................... 51 4.3.2 Components details of the proposed framework ............................................... 51 4.3.3 Framework Properties ....................................................................................... 52 4.3.4 Implementation Consideration .......................................................................... 53 4.4 Conclusion ........................................................................................................... 54 CHAPTER FIVE: SUMMARY, CONCLUSION AND RECOMMENDATION55 5.0 Introduction .......................................................................................................... 55 5.1 Summary of the Study .......................................................................................... 55 5.2 Conclusion ........................................................................................................... 56 viii 5.3 Recommendations ................................................................................................ 57 5.4 Area for Further Research .................................................................................... 58 REFERENCES ......................................................................................................... 60 APPENDICES .......................................................................................................... 65 ix LIST OF TABLES Table 1. 1: Sample categories of social media ............................................................. 2 Table 4. 1: Message distribution ................................................................................ 35 Table 4. 2: Training dataset distribution category-wise ............................................. 37 Table 4. 3: Testing dataset distribution ...................................................................... 38 Table 4. 4: Baseline performance............................................................................... 41 Table 4. 5: The performance of Classifiers on varying data size ............................... 43 Table 4. 6: Classifiers performance based on Features Representation ..................... 45 Table 4. 7: Performance on Dataset with 3-categories............................................... 47 Table 4. 8: Performance for Dataset with 2-categories .............................................. 48 x LIST OF FIGURES Figure 1. 1: Social Networks Usage (2010-2020) ........................................................ 3 Figure 2. 1: Support Vector Machine Margin ............................................................ 16 Figure 2. 2: Artificial Neural Network-Multi-Layer Peceptron ................................. 18 Figure 2. 3: Pseudo-code for Radom Forests ............................................................. 19 Figure 2. 4: Study Conceptual Framework ................................................................ 27 Figure 3. 1: Text Classification Framework .............................................................. 31 Figure 4. 1: Distribution of Messages in Training Dataset ........................................ 37 Figure 4. 2: The most frequent words in dataset ........................................................ 39 Figure 4. 3: Performance Evaluation on Accuracy .................................................... 42 Figure 4. 4: Classifiers Learning Rate........................................................................ 44 Figure 4. 5: Classifiers performance based on Features Representation.................... 45 Figure 4. 6: Performance comparisons on n-gram feature ........................................ 46 Figure 4. 7: Performance for General purpose and Categorical models .................... 48 Figure 4. 8: Comparison of False Positive Rate ......................................................... 49 Figure 4. 9: Time taken to train and test model ......................................................... 50 Figure 4. 10: Study Proposed Framework .................................................................. 51 xi LIST OF APPENDICES Appendix 1: Sample arff Message file ....................................................................... 65 Appendix 2: Sample list of stop words ...................................................................... 66 Appendix 3: Corrections Report as per External Supervisors Observation ............... 71 xii LIST OF ABBREVIATIONS AI Artificial Intelligence ANN-MLP Artificial Neural Network- Multi Layer Perceptions ANN Artificial Neural Network API Application Programming Interface BOW Bag-of-Word CIVE College of Informatics and Virtual Education DMT Data Mining Technique HTTP Hypertext Transfer Protocol IBK Instance-Based K-nearest neighbor JSON JavaScript Object Notation ML Machine Learning MLA Machine Learning Algorithms MNOs Mobile Network Operators NLP Natural Language Processing OSNs Online Social Networks POS Part-of-Speech REST Representation State Transfer SA Sentiment Analysis SL Supervised Learning SVM Support Vector Machine TC Text Categorization TCRA Tanzania Communication and Regulatory Authority TF-IDF Term Frequency Inverse Document Frequency TOS Terms of Services WEKA Waikato Environment for Knowledge Analysis xiii CHAPTER ONE INTRODUCTION 1.0 Introduction This chapter discusses the key concepts that define the disciplinary subject matter concerning the study whose design and findings are presented in this dissertation. A critical review of some of the key ideas that were instrumental in the choice of the research topic is presented. Background information to the research topic is followed by the definition of the research problem, the study objectives as well as the research questions that guided the study. The chapter is concluded by discussing the significance of the study, its scope, limitations and the organization of the research report. 1.1 Background of the Study Online Social Networks (OSNs) sites are computer driven social networks which provide users with the flexibility to create different online issues like communities, share information, ideas and personal messages. Since their establishment, OSNs such as MySpace, Facebook, Twitter, YouTube, Google+, Cyworld and Bebo, have attracted millions of users around the world. Many of these global users have integrated these sites into their daily practices such as business (Ellison & Boyd, 2008). Among other factors, the prosperous of online social networks is due to web 2.0 technologies. Web 2.0 technologies are based on social nature which provides users with flexible collaboration. In addition, web 2.0 technologies have resulted in different varieties of social media (Vanhove et al., 2013). 1 1.1.1 Web 2.0 Technologies and Social Media The Web 2.0 technologies are categorized depending on their main purposes. According to Lutu (2015), Web 2.0 technologies enable users to create and share social media. Table 1.1 provides a summary of some of the well-known categories of social media. Table 1. 1: Sample categories of social media (Lutu, 2015) Media category Purpose Example of service Blogs(web-logs) Facilitate the expression of Michuzi blog, personal opinions by the public JamiiForum Facilitate the expression of Twitter Micro blogs personal opinions about what is happening right now Social networks Professional or social networking Facebook, LinkedIn, sites which facilitate meeting twitter people and sharing content Collaborating Collaborative reference works Wikipedia (e.g. Wikipedia) that are built using wiki-style software tools Media sharing Facilitate the sharing of digital YouTube media, e.g. videos The advancement of the Web 2.0 technologies and social media and the reasons for their widespread have been attributed by the rapid increase in mobile devices like smart phones, tablets, laptops and desktop computers and the availability of affordable internet facilities provided by the Mobile Network Operators (MNOs). In addition, the social messengers or chat apps installed in the mobile phones and computers have increased their growth. Furthermore, social networking sites have 2 gained popularity because they provide users with opportunity to meet with new people and join groups of their own interests. Moreover, these sites are free to get and do not require users‟ designing or publishing skills (Asur & Huberman, 2010; Hee et al., 2015). 1.1.2 Statistical Usage of Social Networks Sites A large number of users are connected on the social network sites. According to World Newsmedia Network(2015), Facebook is the most popular, followed by WhatsApp, WeChat and Twitter. WhatsApp is reportedly to have averages 600 million users per month, while WeChat have about 500 million users and Twitter have averages of 300 million users per month. Furthermore, according to Statista Inc (2017) social media penetration worldwide is increasing with 68.3% of internet users being social media users. Social networking is becoming one of the most popular online activities with high rate of user engagement and expected to increase as indicated in Figure 1.1. Figure 1. 1: Social Networks Usage (2010-2020) (StatistaInc, 2017) 3 Huge amount of information is generated as more people are connected on social networking sites. Whilst social media messages are important to enhance communication and business they also result into offending others. The explosion of online social media has given rise to concerns about new forms of offensive messages. The offensive messages contain violence, aggression, and volumes of inappropriate messages including those likely to be assaulting, annoying, or harassing to a recipient. Because of the aforementioned phenomenon, a new field of study called Sentiments Analysis (SA) has been introduced. This new field of study is also called Opinion Mining, which analyzes people‟s opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes (Laskari & Sanampudi, 2016). Because of these emotions toward entities, opinions and attitude, automated offensive messages detection techniques form apart of SA (Sood et al., 2012). SA involves the application of Machine Learning (ML) techniques, Data Mining Techniques (DMT), Natural Language Processing (NLP), Computational Linguistics (CL) and Mathematics to build models which extract insights or knowledge from social media and categorize data into different classes. Machine learning is the field of Artificial Intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Thus, it focuses on developing learning algorithms that perform learning tasks automatically and exhibit intelligent behaviors without human intervention (Muhammad & Yan, 2015). To be able to build accurate models, machine learning is categorized as either supervised, unsupervised or reinforced and each category being applied depends on the task to be solved and type of data to be processed (Dasgupta & Nath, 2016). 4 The same benefits of web 2.0 technologies are observed in developing countries, particularly in Tanzania where people are connected to these different social networks. Their penetration and adoption rate have been attributed to the relative availability and affordability of Smartphone, laptops and desktop computers as well as robust internet facilities offered by Mobile Network Operators (MNOs) (Msavange, 2015; TCRA, 2010). In addition, the social network sites are language independent in which Tanzanians use Kiswahili for social networking. Despite the benefits observed in social networks, some people misuse the medium by promoting offensive and hateful language like “Mimi namchukia huyu baba namungu anisamehe nitahira na nimwehu mpaka anaowachunga wote matahira nawehu”. The dimensions of these misconducts range from hate speech, cyberbulling and cyber–stalking all targeting specific group characteristics such as race, ethnic origins, gender, religion and sexual innuendo (Reynolds, 2012). Also, some users send suspicious messages, insults or provoke other people. All of these behaviors are contrary to social network Terms of Service (TOS) (LegalAid, 2010) and Tanzania‟s Cybercrime Act of 2015 (Tanzanian government, 2015). To some extent, it may be argued that the initiative made by the Tanzanian government through the passing of the Cybercrime Act of 2015 has created some basis for rescuing the situation, whereby people may now be sued or fined if they behave abusively in OSNs. However, social networks‟ data are of high volume, velocity and variety. For administrators or legal authorities to review these online messages and other posts to detect offensive content manually is extremely labor intensive endeavor as well as time consuming, and not suitable and scalable in reality (Chen, 2012). Moreover, people‟s messages might be misclassified by 5 placing them in wrong categories because of personal perceptions, interest and/or hatred which may result into unfair pains to them. Lastly, with human judgments on the same problem a number of discrepancies may result because different sensitivity, mood, background and some other subjective conditions among different people (Razavi et al., 2010). 1.2 Statement of the Problem Manual classification of the offensive messages may work well for small datasets. This means that when dealing with a small number of groups of individuals, the offensive messages are often few and can be eliminated easily. However, when the recipient receives lots of data it becomes difficult to detect offensive messages, hence, the application of machine learning and data mining techniques become crucial parts (Liu, 2012). To automate the detection process of offensive messages in social networks, several approaches and techniques have been proposed. Bretschneider and Peters (2017) and Papegnies et al.,( 2017) implemented automated approaches to detect offensive language statements towards immigrants/foreigners and online-community respectively. The proposed approaches were based on German and French languages for social media dataset. Furthermore, a dictionary based-approach (Hilte et al., 2016) was suggested and implemented to detect racism in Dutch Social Media. Despite the already suggested approaches towards offensive language detection most of the studies show that, existing approaches are language dependent. Their datasets were prepared from specific language settings like English, Germany, French, and Dutch. Such approaches raise a need to conduct a study based on Kiswahili language, given the fact that it is a complex morphological, syntactical, and 6 semantically language and having rapid immerging words which demands special treatments (Massamba et al., 1999, Tesha, 2015). In addition, Kiswahili language is a national language in Tanzania and a lingua franca in much of Eastern and central Africa (Mulokozi, 2000; Hinnebusch, 2003). Moreover, it is hardly to find a framework that has focused on automated detection and discrimination of offensive messages in social networks in Kiswahili language. Therefore, based on the aforementioned shortcomings, there is justification wishing to undertake a study in this subject field. The aim of this research therefore, was to propose a framework for detecting offensive Kiswahili messages on social networks sites by applying machine learning models. First, the automated process will help to eliminate or mark offensive messages from a list of messages before being shared. Second, the framework will increase trustfulness, user experiences and promoting gigantic adoption of social networking sites in Kiswahili speaking countries. 1.3 Objective of the Study 1.3.1 General Objective The purpose of this study was to propose a framework for automating the detection of offensive messages in social networks under Kiswahili settings by applying appropriate Machine Learning Algorithms (MLA). 1.3.2 Specific Objectives i. To create Kiswahili dataset of offensive messages from social networks for generating feature vectors. ii. To build and evaluate the model by applying appropriate machine learning algorithms techniques for Kiswahili dataset. 7 iii. To propose the architectural framework that can be adopted in social network sites for detecting offensive language in Kiswahili setting. 1.4 Research Questions i. How Kiswahili dataset of offensive messages from social networks can be created for generating feature vectors? ii. How machine learning algorithms can be applied and evaluated in building the models which can detect improper messages in Kiswahili? iii. How a framework for detecting offensive Kiswahili messages in social network sites can be designed? 1.5 Significance of the Research The study sought to contribute to the growing literature the automated detection and barring of offensive messages that have become of increasing concern to most users of online social media. Findings of the study are intended to benefit various groups who are either directly or indirectly concerned with proper usage of online social networks (OSNs). 1.5.1 To other Researchers and Academicians The study aim to contribute to the body of knowledge on how to apply Machine Learning Algorithms can be applied in detecting improper Kiswahili messages in social networks. The accurate MLA applied in Kiswahili language would also be used in other task such as businesses for the Tanzanians. 1.5.2 To the OSNs Providers and Developers The study findings will help the social network developers and providers to adopt and integrate the developed framework into their applications so that they can filter 8 improper posts automatically before they are shared to different users. Through objective one also, the study will help to keep digital evidences for forensic investigations in case they are required. 1.5.3 To the Policy Makers and Law Enforcers The policy makers and law enforcers like TCRA may adopt and use the framework to develop a tool for detecting online offenders and take appropriate disciplinary measures against them. The model will help to eliminate human interventions, reduce labor intensive methods of detecting social media messages that seek to promote hatred while also seeking to save time thus, resulting into balanced judgments. The framework will also help to keep digital evidences for forensic investigations. 1.5.4 To the OSNs Administrators and Normal Users The model will help normal users to feel free and more comfortable while being online and joining groups as there will be no more insult and abusive languages in social networks. 1.6 Scope of the Study The study is confined on verbal Kiswahili messages posted on Facebook, JamiiForum and YouTube social networks only. The choice of these social networks was due to the availability of publicly pages for easy extraction of messages. 1.7 Limitations of the study i. Limited access to information, some messages were not accessible due to the confidentiality and privileges to join the groups, and also not all of respondents were cooperative enough, to provide some of the information 9 hence, led to difficulties in collecting data in time. To handle this issue, only publicly available pages were only considered. ii. Time constraints, Time allocated was very short, the study needed more time for making it more effective and efficient in collecting more messages from social networks. The study handled this challenge through increasing more numbers of hours for collecting messages from three social networks. 1.8 Organization of the Study The remaining part of the study report is organized as follows. Chapter two discusses the literature that was reviewed during the study. Chapter three presents the methodology that was employed for conducting the study, including outline of the data collection, experiment setup and analysis methods to accomplish each specific objective, ethical issues that were considered during the entire study and the validity and reliability of the data. Chapter four presents and discusses the experimental findings obtained in order to address research questions. Chapter five presents conclusion, recommendations and areas for further investigations. 10 CHAPTER TWO LITERATURE REVIEW 2.0 Introduction This chapter discusses in summary form the various published material that were consulted in order to understand and investigate the research problem. The literature review facilitated the researcher‟s efforts in availing oneself with conceptual understanding and definition of key terms, text classification algorithms, as well as acquaintance with empirical studies that focus on the frameworks and detection techniques of offensive messages in social network sites. It also narrates research gap, conceptual framework and conclusion remark. 2.1 Conceptual definitions 2.1.1 Offensive Language Offensive language has been defined by Razavi et al.,(2010) as phrases which can mocks or insult somebody or a group of people (attacks such as aggression against some culture, subgroup of the society, race or ideology in a tirade). Moreover, they itemized several categories of offensive language. Taunts: These phrases try to condemn or ridicule the reader in general. References to handicaps: These phrases attack the reader using his/her shortcomings (i.e., “IQ challenged”). Squalid language: These phrases target sexual fetishes or physical filth of the reader. Slurs: These phrases try to attack a culture or ethnicity in some way. Homophobia: These phrases are usually talking about anti- homosexual sentiments. Racism: These are phrases that intimidate the race or ethnicity of individuals Extremism: These phrases target some religion or ideologies. 11 There are also some other kinds of flames, in which the flamer abuses or embarrasses the reader (not an attack) using some unusual words/phrases like: Crude language: expressions that embarrass people, mostly because it refers to sexual matters or excrement. Disguise: expressions for which the meaning or pronunciation is the same as another more offensive term. Provocative language: expressions that may cause anger or violence. Unrefined language: some expressions that lack polite manners and the speaker is harsh and rude. Although offensive language is still an ambiguous terms, this study adopted and referred to the above definitions more specifically on squalid language (Sexual), politics related offensiveness messages. Moreover, the study used the mentioned areas to collect and analyze data from social network sites so as to create Kiswahili dataset. 2.1.2 Web Crawling and Data Extraction Web crawling involve automated collection of information from the web. Web crawling is performed by specialized web crawlers. The crawlers collect and update specific web content in order to perform web search and indexing of content (Barcaroli et al., 2014). 2.1.3 Machine Learning Mitchell (1997) defined machine learning as a computer program which learns from experience E with respect to some task T and some performance measure P , if its performance on T , as measured by P improves with experience E . In point of fact, 12 this was as well defined by (Alpaydın, 2010) as the process of searching for a good function F : I O , where I the set of possible inputs, and O the set of possible outputs. Therefore, machine learning involves devising the learning models which in fact can automatically adjust with external data or environment. 2.1.3.1 The Categories of Machine Learning Methods According to Alpaydın (2010) three typical categories of machine learning namely supervised learning, unsupervised learning and reinforcement learning: 2.1.3.2 Supervised Learning In Supervised Learning (SL) the algorithm observes some example input–output pairs and learns a function that maps from input to output. In this case, the training set given for the algorithm is the labeled dataset in which the learning process is to find the relationships between the feature set and the label set. The resulting relationship is the estimated function F : X Y of the given labeled set for the training examples x, y known as a model (Kotsiantis, 2007). The resulting classifier F is then used to assign class labels to the testing instances where the values of the predictor features are known, but the value of the class label is unknown. Thus, If each feature vector x is corresponding to a label y L, L l1 ,l 2 .......lc (𝑐 is usually ranged from 2 to a hundred), the learning problem is denoted as classification. On the other hand, if each feature vector x is corresponding to a real value y R ; the learning problem is defined as regression problem (Chao, 2011). The knowledge extracted from supervised learning is often utilized and applied in prediction tasks. 13 2.1.3.3 Unsupervised Learning The training set given for the unsupervised learning algorithm is the unlabeled dataset. For the given feature vector x for data D x0 , x1 ,. .. xn the aim is to look for a model F which gives some useful insight in the data D . The most common unsupervised learning task is clustering: detecting potentially useful clusters of input examples. Other examples includes probability density estimation, finding association among features, and dimensionality reduction (Xu & Wunsch, 2005). In general, an unsupervised algorithm may simultaneously learn more than one properties existing in the dataset, and the results from unsupervised learning could be further used for supervised learning (Nilsson, 2005). 2.1.3.4 Reinforcement Learning Reinforcement learning (RL) uses a scalar reward signal to evaluate input-output pairs and hence discover, through trial and error with its environment, optimal outputs for each input (Sathya & Abraham, 2013). 2.2 Text Classification Algorithms During SL technique one of the major outputs is to deduce a function that can assign task into different classes known as classification or categorization. One of the commonly known tasks in classification is text classification. According to (Kotsiantis, 2007) Text Categorization (TC) is used to automatically assign previously unseen documents to a predefined set of categories. To accomplish the TC task several SL algorithms for text categorization has been suggested, applied and evaluated accordingly. 14 2.2.1 Naïve Bayes Naïve Bayes is a simple classifier based on the Bayes theorem. It is a statistical classifier which performs probabilistic prediction. In reality, the classifier works under the assumption that, the attributes are conditionally independent. An equation 2.1 shows a typical Naïve Bayesian formula as from the mathematical point of view (Seif, 2016). PCi / X P X / Ci PCi ........................... (2.1) P X Using equation (2.1) the classifier, or simple Bayesian classifier, work as follows; 1) Let D be a training set of tuples and their associated class labels. Each tuple is represented by an n-dimensional attribute vector, X X1, X 2 ,. ....... , X n , depicting n measurements made on the tuple from n attributes, respectively, A , A .........,A . 1 2, n 2) Suppose that there are m classes C1,C2 ......... Cm . Given a tuple, X, the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. That is, the Naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if PC i / X PC j / X for 1 ≤ j ≤ m; j≠ 1. Thus we maximize PCi / X . The class Ci for which PCi / X is maximized is called the maximum posteriori hypothesis. 3) From equation (1), as P X is constant for all classes, only P X / Ci PCi need be maximized. Then predicts data item X belongs to class Ci if and only if has got the highest probability compared to other class label. 15 2.2.2 Support Vector Machine (SVM) The SVM uses the concept of margin. It constructs a maximum margin separator which in actual sense is a decision boundary with the largest possible distance to example points. Suppose in training examples xi, and the target values yi {-1, 1} SVM searches for a separating hyperplane, which separates positive and negative examples from each other with maximal margin (optimal hyperplane) (Christopher, 2006; Boser, Guyon and Vapnik ,1992). Figure 2. 1: Support Vector Machine Margin (Chang and Lin, 2011) If the training data is linearly separable, then a pair (w, b) exists such that W𝑇Xi + 𝑏 ≥ 1, ƒ𝑜𝑟𝑎𝑙𝑙Xi ∈ 𝑃 W𝑇Xi + 𝑏 ≤ −1, ƒ𝑜𝑟𝑎𝑙𝑙Xi ∈ 𝑁 ………………………………………… (2.2) 16 SVMs (Boser, Guyon and Vapnik ,1992) find a maximum separating hyperplane between the examples from the two classes. K xi , x j x i T . x j is called a kernel function. Currently, there are several known kernel functions that can be applied for solving various problems: i. Linear: Kx i xi .T x j ii. Polynomial: K xi , x j xi T ..xj r d, 0 iii. Gaussian radial basis function (RBF): K xi , x j exp || x i x j || , 2 for 0 iv. Sigmoid: Kx i tanh(x i.T x j r) In the kernel functions above, r and d are kernel parameters that need to be tuned (Chang and Lin, 2011). 2.2.3 Artificial Neural Networks- Multilayer Perceptron (ANN-MLP) The ANN-MLP is one among popular ANN consisting of multiple layers of computational units, usually interconnected in a feed-forward way representing the nonlinear mapping between input vector and the output vector (Tesha & Baraka, 2015). Each neuron in one layer has direct connections to the neurons of the subsequent layer as depicted in Figures 2.2 (Christopher, 2006). MLP is most suitable for approximating a classification function, and consists the input layer, one or more hidden layers of processing elements, and the output layer of the processing elements (Osmanbegović & Suljić, 2012). 17 Figure 2. 2: Artificial Neural Network-Multi-Layer Peceptron (Christopher, 2006) The Multi-Layer Perceptron (MLP) is a supervised learning algorithm which uses back-propagation to learn a function from given examples. Kumari & Godara, (2011) on their part argue that the use of back-propagation for reducing classification error by optimizing the weights makes MLP the most commonly used and wellstudies ANN architecture capable of learning arbitrarily complex nonlinear functions to arbitrary accuracy levels. From Figure 2.2 (Danjuma & Osofisan, 2015) defines ANN based on: a) Interconnection pattern between different layers of neurons; b) Learning process for updating the weights of the interconnection; and c) Activation function that converts a neuron‟s weighted input to its output activation. 2.2.4 Random Forest Random Forest (RFs) was originally developed by Leo Breiman (2001) in 2001. It combines two machine learning technique, one being bagging and the other random feature selection. A cording to (Biau, 2012) RFs is an ensemble learning classifier consisting of group of un-pruned or weak decision trees made from the random 18 selection of samples of the training data. Its premises is based on the concept of building many small, weak decision trees in parallel and then combine the trees to form a single, strong learners by aggregating (majority vote for classification or averaging for regression) the predictions of the ensemble (Ali et al., 2012). The algorithm work as follows: for each tree in the forest select a bootstrap sample S* of size n from the training data and then learn a decision-tree using a modified decision-tree learning algorithm. At each node of the tree, some subset of the features v from p features is randomly selected. The node then splits on the best feature in v rather than p. Finally, a random forest with M decision trees is formed by repeating M times as above procedures and then the random forest is used to predict test data as depicted in Figure 2.3. During testing, each test point is simultaneously pushed through all trees (starting at the root) until it reaches the corresponding leaves and classification is decided by all the votes (Criminisi et al., 2012). Figure 2. 3: Pseudo-code for Radom Forests (Zhang & Haghani, 2015) 19 According to (Jia et al., 2013) Random Forest has two most significant parameters, one is the number of features used for splitting each node of decision tree (m p where p is the total number of features), another parameter is the number of trees (M).Since Random Forest uses bagging and restrict each split-test to a small random sample of features it decreases the correlation between trees in the ensemble and help to learn more decision tree in a given amount of time. One obvious properties of the algorithm is that it doesn‟t produce over fitting phenomenon when the characteristic parameters of higher dimension are used. 2.2.5 Decision Tree Classifier (J48) Decision tree classifier builds a decision tree based on if-then. According to (Kumari et al., 2011) decision tree separate the training datasets recursively into small branches to construct a tree for the purpose of improving prediction accuracy. This step is repeated at each leaf node until the complete tree is constructed. The tree uses entropy to determine the similarity of the sample to be split on the same node and information gain which determine the smallest entropy value (Criminisi et al., 2012). 2.2.6 Baseline Classifiers A baseline classifier gives baseline accuracy on the dataset that must always be checked before choosing sophisticated classifiers. It is a method that uses heuristics, simple summary statistics, randomness, or machine learning to create predictions for a dataset. The resulting metrics will then become what must be compared with other machine learning algorithm against. 20 ZeroR Classifier ZeroR or Zero Rule is the classification method which depends on the target and ignores all predictors. zeroR simply predicts the majority class and is useful for determining a baseline performance as a benchmark for other classification methods (Nasa, 2012). OneR Classifier OneR, short for "One Rule", is the classification algorithm that generates one rule for each predictor in the data, and selects the rule with the smallest total error as its one rule. To create a rule for a predictor, we have to construct a frequency table for each predictor against the target as follows (Nasa, 2012). The algorithm also serves as the baseline for evaluating other classifiers on the same dataset. 2.3 Model Evaluation 2.3.1 Evaluation Metrics The evaluation of classification methods is based on the number of correctly classified messages against falsely classified messages (Ian & Frank, 2005). There are four different situations that occur when a new message is classified: True Positive (TP): The classifier correctly indicates the message as offensive. In other words, the message is rightly classified as sexual or politics. True Negative (TN): The classifier correctly indicates the message is not offensive. In other words, the message is rightly classified as normal. False Positive (FP): The classifier wrongly predicts the message as offensive (sexual or politics) when it is actually normal message. 21 False Negative (FN): The classifier wrongly indicates the message is normal when it is actually offensive. The simplest manner to evaluate the performance of a classification system is to analyze its accuracy. Accuracy shows the general correctness of a classifier and is calculated as follows: Accuracy TP TN ………………………………………….(2.3) TP TN FP FN However, a classification system that automatically labels samples as normal message would yield very high accuracy results when dealing with dataset that contain only a very small amount of positive sample (sexual and politics) (Vandersmissen, 2012). Accuracy may not work well in an environment where one category dominates the others. Therefore, to use the metrics precision which implies the proportion of predicted positive which are actual positive and recall, the proportion of actual positives which is predicted positive as shown in equation 2.4 and 2.5 respectively. Pr ecission Recall TP ……….………………………………………………….….(2.4). TP FP TP …………………………………………….…………...……(2.5) TP FN Furthermore, by analyzing precision and recall provide better understanding of the performance of the detection of offensive messages. The evaluation also involves the f1 Measure, which is an evenly weighted combination of both precision and recall: ➚1 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 = 2 𝑥 𝑝𝑟𝑒𝑐i𝑠i𝑜𝑛 𝑥 𝑟𝑒𝑐𝑎𝑙𝑙 ................................................................. 𝑝𝑟𝑒𝑐i𝑠i𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 22 (2.6) 2.3.2 Cross-validation Apart from testing data on a separate test set, it is also possible to extract a small part from training set to use as validation set and repeats the process several times. This method is called cross –validation or stratified cross- validation (Ian et al., 2005). In this technique the training dataset is divided into fixed k parts called folds and the classifier is then subsequently trained on k-1 parts and tested on the remaining part. The procedure is repeated k times so that in the end, every instance will have been used exactly once for testing. The error rates on the different iterations are averaged to yield an overall error rate. This technique provides an accurate view over the performance of a classifier. 2.4 Empirical Studies Bretschneider and Peters (2017), conducted a study on detecting offensive statements toward foreigners in social media based on Germany language, they conclusively proposed an approach to automatically detect such statements to aid personnel in the labor-intensive task. They performed binary classification task and multi class classification by applying machine learning approach on bag-of-words (BOW) model. The developed models were evaluated by applying Precision, Recall and F1-measure and yield the precision values (75.26% and 73.8%) and f1-value of 67.91%. Chen et al., (2012) for their part, offer a proposal on how offensive language may be detected in social media to protect adolescent online safety. In this case they proposed the Lexical Syntactic Feature (LSF) architecture to detect offensive content and identify potential offensive users in social media. Their experiment revealed that, the LSF achieved precision of 98.24% and recall of 94.34% in sentence 23 offensive detection as well as precision of 79.9% and recall of 71.8% in user offensive detection and the study applied English data sets, the processing speed of LSF is approximately 10msec per sentence signifying the need to be adopted in social media. Garcia-Recuero (2016), conducted a study on discouraging abusive behavior in Privacy-Preserving Online Social Networking applications. The collected data from Twitter Social Media and created on Troll slayer to collect data from Twitter while preventing users privacy on the sensitive data. Saleem et al., (2016) for their part conducted a study on a Web of Hate: Tackling Hateful Speech in Online Social Spaces. The base for the study was about proposing hateful speech solutions to social media. They proposed an approach by selfidentifying hateful communities as training data set .They applied multiple Machine Learning algorithms to generate the language models of hateful communities, especially they applied Naïve Bayes (NB), Support Vector Machine (SVM) and Logistic Regression (LR).In addition, they used Web scrapping Libraries to retrieve all public available comments on Reddit website and a total of 50,000 comments were collected for training and testing sets. Their study revealed that, SVM and NB outperformed better than LR, they further suggested that, the same study can be conducted on social networks like Twitter and Face book. Killam et al.,(2016) classified Android Malware through Analysis of String Literals. On their study, they applied a Linear Support Vector Machine (SVM) and a 3-gram of word level to classify Android Malware through analysis String Literals. The resulting model correctly classified the malware applications with the accuracy of 99.20% while maintaining the false positive rate of 2.00%. 24 Freeman (2015) applied Naïve Bayes to detect Spammy names in the Social networks. The study was guided by the assumption that, in social networks there exist fractious identities which might be used to send spam messages, engage in abusive activities and post malicious links. Data was collected from Linkedln for training and validating the model. The resulting model was evaluated by using Area under the ROC Curve (AUC) and achieved AUC as 0.85. Dewan and Kumaraguru (2017) on their study about Face book Inspector (FbI): Towards automatic real-time detection of malicious content on Face book; data was collected for 16 month time period on face book to develop training and test data set. They applied Supervised Learning models including Naïve Byesian, Decision Trees, Random Forest and Support Vector Machine-based models. They implemented Based on Learning model, Face book Inspector, a REST API-based browser plug-in for identifying malicious Face book posts in real time. In all aspects, SVM model achieved accuracy of over 80% on publicly features while Random Forest Classifier had a better recall and ROC AUC values. Opesade et al., (2016) conducted a Forensic Investigation of Linguistic Sources of Electronic Scam Mail using a Statistical Language Modeling Approach .Their study aimed at investigating the propensity of Nigeria‟s involvement in authoring the scam mails fraudulent. Their experiment study included a total of 873 scam mails 349 non scam mails from different scam Baiter‟s Websites. The experiment was carried on Waikato Environment for Knowledge Analysis (WEKA) data mining software and among of all applied Machine Learning Algorithms, Instance Based K-nearest (IBK) neighbor was found to be the most precise model in terms of accuracy and Kappa statistics to detect the sources of scam mails. 25 Hee et al.,(2015) conducted a study on automatic detection and prevention of Cyberbullying. Their study was conducted through applying SVM as a learning algorithm in Python programming. During data pre-processing they applied tokenization, Partof-speech (POS) tagging and lemmatization using LeTs pre-process Toolkit. Evaluation of the resulting model was done using 10-fold Cross-validation and Fscore and recall as evaluation metrics. The resulting model was capable of detecting cyber-bullying with F-score of 55.39% and accuracy of 78.50%. Gerbet & Kumar,( 2014) on their part implemented Google Safe browsing database system which classifies malicious URLs. It consists of API interface to which query the state of a URL. From the Clients side, URLs is sent and checked using HTTP GET or POST requests and the server‟s response contains directly an answer for each URL query. A similar study on cyber bullying detection was conducted by Huang et al (2014) using social and textual analysis, where their dataset was divided into two parts, with 70% of the dataset (both bullying and non-bullying messages) were used as training set and 30% being used as the testing set. Different classifiers were used in the experiment, these included J48, Naïve Bayes, SMO, Bagging and Dagging. The experiment was conducted in WEKA 3.0 as implementation tools. The evaluation metrics employed were Receiver Operating Characteristic (ROC) and True Positive rate were used. 2.5 Research Gap Despite several studies that have been conducted regarding offensive messages detection on social networks, still there is hardly any study that has been conducted 26 regarding Kiswahili language context. Furthermore, the proposed solutions from these studies do not match with how one may deal with Kiswahili offensive detection since they depends on the language context in which data were originally collected. This raises the need to conduct the study by considering Kiswahili language as the case of this study. This is because all existing languages in the world differ in syntax, pragmatic and semantic which force the researcher not to relay on the already conducted studies to save for Kiswahili environment but to considered these studies as benchmark for carrying a new study. It is believed that this study will bridge the currently existing gap and open up new research areas. 2.6 Conceptual Framework The Figure 2.4 represents the conceptual framework that guided this study. Figure 2. 4: Study Conceptual Framework From Figure 2.4 text messages were collected from social network like Facebook by using WebCrawler. The message containing offensiveness were labeled by using human annotators and used to generate a feature vectors. The generated feature vectors were used as training and test set to configure the MLAs, and the resulting 27 models were evaluated based on statistical metrics. The results obtained were used to save as the inputs to propose the frameworks for automating the detection of offensive messages in social networks. 2.7 Conclusion This chapter has reviewed the literature that has a direct relation with the problem under investigation. It began by defining the key terms used, followed by reviewing the mathematical theories guiding the study and the empirical analysis of the similar studies existing in the world. Finally, the chapter ends with a descriptive conceptual framework, research gap and conclusion remarks. The next chapter will be discussing research design and methodology employed in this study. 28 CHAPTER THREE METHODOLOGY 3.0 Introduction This chapter defines the research design, research approach, data collection methods and tools, data analysis techniques, ethical consideration, as well as validity and reliability issues. 3.1 Research Design The research design that was adopted in this dissertation is experimental and case study approaches. With the case study, only Kiswahili verbal messages extracted from social networks such as Facebook, YouTube, and JamiiForum were the ones considered. The choice of this design was based on the nature of study objectives and questions (Saunders et al., 2009). 3.2 Research Setting The study was conducted at College of Informatics and Virtual Education computer laboratory. The whole software required 5.5 GB of hard drive space, 4 GB of Random Access Memory (RAM) and 1GHz of processor. For the sake of this study, the software was installed on Windows 8.1 desktop computer with 16GB of RAM, core i7 with 3.4GHz processor and 1TB of hard drive in order to carry experiment. 3.3 Research Approach The study has adopted mixed research approach aiming to collect and analyze quantitative data from experimentation and qualitative data from documentary review. The choice of mixed research approach was influenced by the methods of data collection, analysis, and interpretation (Creswel, 2014). 29 3.4 Data Collection Method and Tool 3.4.1 Primary Data To respond to the research question one which demands the collection of relevant Kiswahili messages from social network, Jsoup API based web crawler was used. The choice of this tool was due to their platform independent and ability to provide robust application that can be applied across several social network sites (Nakash et al., 2015) . Furthermore, with respect to what guided the researcher in the collection and analysis of Kiswahili data, the grammatical conceptual framework of Kiswahili syntax adopted by Masamba et al., (1999) was used as a guide. Moreover, observation technique was used to collect data from experimental tool for further analysis. This technique facilitated the process of drawing tables, comparison and conclusion for defined research questions. 3.4.2 Secondary Data Structured document review was used to determine the appropriate machine learning algorithms for text classification task and to determine the evaluation metrics that served for research question two. Relevant literatures that were mainly visited are journals, essay, dissertations, and research projects using search terms. 3.4.3 Experiment Tool and Environment In experimental setup the study used Waikato Environment for Knowledge Analysis (WEKA) Toolkit. WEKA is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from own written Java code. WEKA contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes (Srivastava, 2014). The 30 tool can also be integrated with MEKA for multi-label learning and evaluation. In multi-label classification, the goal is to predict multiple output variables for each input instance. MEKA is based on the WEKA Machine Learning Toolkit; it includes dozens of multi-label methods from the scientific literature, as well as a wrapper to the related MULAN framework. 3.4.4 Experimentation Steps To respond to the research question two, which focuses on developing and evaluating the models, several steps need to be performed to reach the desired output. Figure 3. 1Text Classification Framework Adapted from Ramya & Pinakas (2014) 3.4.5 Data Preprocessing The messages collected were categorized into two parts. Category one had 75% of the total collected messages and these were used for training the models while the remaining 25% were used for testing the efficiency of the models. Since offensive language is a subjective term, 75% of the training dataset were distributed to three annotators, in which every participant was required to label the message as to whether the message is offensive or not. A message was considered as either 31 offensive or non-offensive depending on the total scores obtained from the annotators. The labeled training dataset was preprocessed into format relevant for MLAs. The preprocessing task involved the removing stop list and converting the string of words into feature vectors (Read, 2016). 3.4.6 Feature Extraction i. Feature selection Features generated were selected based on their importance in the word and in order to improve the performance of the model. The feature selection method that was adopted in this study was Term Frequency-Inverse Document Frequency (TF-IDF). The TF-IDF is composed of two terms, first, Term Frequency (TF) - which calculates the number of times a word appears in a document divided by the total number of words in a documents. Secondly, Inverse Document Frequency (IDF) which computes the logarithm of the total number of documents in dataset divided by the number of documents where the specific term occurs. The product between TF and IDF determines the importance and uniqueness of the word in the dataset and provides high performance when combined with Bag-of –Word model (Wu et al., 2008). 3.4.7 Training the Model The study applied the collected data to train the selected algorithms as to configure and devise different models which were then evaluated while classifying the test dataset. 3.4.8 Evaluating the Models The last phase was to evaluate the resulting models based on the performance metrics. The evaluation metrics which were observed and recorded were True 32 Positive (TP), True Negative (TN), False Positive (FP), Accuracy, Precision-Recall, f1-measure (f1) and ROC AUC. The models were evaluated by using 10-fold cross validation in one case and by supplying 25% of the independent test dataset. The study applied 10-fold cross validation because in different extensive and numerous datasets with different learning techniques have shown 10 to be the right number of folds to get the best estimate and has been widely used as the standard method in practical (Ian et al., 2005). 3.5 Data Analysis Both quantitative and qualitative data analysis techniques were used in this study. The quantitative data as observed from the WEKA experimental environment were exported into Excel for plotting tables, histograms and line graphs. For comparison purpose, the paired t-test was performed. The paired t-test was used to compare the performance of the classifiers on the 10-folds cross validations. For this case the experiments were repeated 10 times making a total of 100 computations. The average results were recorded. This analysis helped to make inference of the models against performance metrics. The qualitative analysis helped to supplement on quantitative data analysis. These results facilitated the process of making comparisons, interpretations and inferences among different models. 3.6 Ethical Issues The researcher asked for permission from the University of Dodoma administration. Also the confidentiality and privacy of the message from respondents on JamiiForum, YouTube and Facebook were preserved. For confidentiality and privacy issues all the public pages and users accounts/phone numbers from which posts and comments were extracted were not reported anywhere in this study. In 33 addition, all of the reviewed works were acknowledged. The study focused on proving the research concepts rather than seeking to detect whether what the respondents were doing was right or wrong. 3.7 Reliability and Validity The relevant messages concerning improper behaviors from JamiiForum, YouTube and Facebook were collected. To prove the validity of the study large number of offensive messages were collected and the models developed were evaluated based on the statistical evaluation metrics suggested from the literature. The experiment was repeated 10 times and the average of the results were recorded. In addition, the models for classification were supplied with separate test dataset and with 10-fold cross-validation to check for their capability on detecting offensive messages, this provided reliable results. 3.8 Conclusion The chapter has explained the research design and strategy adopted to answer the research questions. Experimentation settings and tools that were used have also been discussed in details. The chapter also presents research methods, data collection methods data analysis tools and techniques, ethical consideration, and validity and reliability issues pertaining to this study. The next chapter will discuss the analysis and findings in relation to the research questions under investigation. 34 CHAPTER FOUR RESULTS AND DISCUSSION 4.0 Introduction This chapter discusses the findings of the study in relation to the research questions. The chapter is organized into three parts. The first part presents results and analysis with respect to the research objectives for creating Kiswahili dataset, the second part presents the results and analysis with respect to building and evaluating the classification models. The last part presents the findings about proposed framework for automated detection of offensive messages. All of the findings attempt to answer the associated research questions set out in the introductory chapter. 4.1 Creating Kiswahili Dataset of offensive messages from social networks 4.1.1 Data Extraction from Social Networks Since there was no dataset of offensive messages available for Kiswahili, the dataset were collected by accessing publicly available social networks pages so as to acquire training and testing datasets. The posts with their corresponding comments were crawled by using Jsoup web crawler. A total of 12,000 messages were collected from Social Networks sites in the period from 10-4-2017 to 10-7-2017 as depicted in Table 4.1. Table 4. 1: Message distribution Social networks # messages collected Facebook 7500 Jamii Forum 3000 YouTube 1500 Total 12000 The study was concerned with the sexual and politics posts and comments categories of offensive messages and normal messages for all categories. To be able to collect 35 and retrieve relevant sexual, politics, and normal messages the public pages to be crawled were properly chosen. From Table 4.1, a total of the collected messages contained all three categories which were further preprocessed to be used in other parts of the study. 4.1.2 Messages Annotation Before the collected messages were used to build the classification models, the collected messages were annotated to give labels relevant to the machine learning algorithms. Since the offensive language in a subjective term messages were given to three annotators for labeling process. Thus, a message was manually labeled as offensive or normal if there is a consensus between at least two of the three (3) annotators. Thus, a total of 12000 messages were provided to the annotators to give a label of 1 if the message is offensive and belongs to sexual category, a label of 2 if the messages is offensive and belongs to politics category and 0 label if the message is normal (non-offensive). In addition, the study considered only the pots and comments for which all three annotators agreed upon the same label others were eliminated from the list. This was done to ensure the ground truth of the dataset is of best quality and facilitate the learning ability of the MLAs. After careful annotations and consensus among annotators and eliminating posts with partial agreement only 11000 messages remained. Sample of 8000 messages were randomly selected with 2000 labeled as 1 (sexual), 1000 labeled as 2 (politics) and 5000 labeled as 0 (normal) which formed training data set as depicted in Table 4.2. 36 Table 4. 2: Training dataset distribution category-wise Message type Total Label Sexual 2000 1 Political 1000 2 Normal 5000 0 Total 8000 Based on the resulting data the percentage distribution of the offensive messages and normal messages are depicted in Figure 4.1. Figure 4. 1: Distribution of Messages in Training Dataset The remaining 3000 of the total messages were used for testing the developed models. Moreover the testing datasets were not manually given labeled as whether they are offensive or non-offensive although their categories were manually identified. 37 Table 4. 3: Testing dataset distribution Message type Total Unlabeled Sexual 700 ? Political 300 ? Normal 2000 ? Total 3000 4.1.3 Data Preprocessing The labeled training dataset were first preprocessed into the format relevant for the classification algorithms. Following are the techniques used to prepare and tune the training data before using it as an input to various classifier algorithms in WEKA. During preprocessing the text file containing posts and comments were converted into Attribute-Relation File Format (ARFF) by using TextDirectoryLoader java class in WEKA Simple GUI. ARFF file contains two attributes: text string that represented a post or comment containing sexual, politics or normal messages and a class attribute denoting the categories of 1, 2 and 0 which corresponds to the text string. The sample of resulting ARFF file is presented in appendix 1. Furthermore, the collected posts and comments had irrelevant features such as stop words, special characters, alphanumeric, English words and html tags to mention but a few. The text file containing stop words such as “na”, „kwaiyo‟”mimi‟ “wewe”, “sisi” punctuations like „.‟ „,‟ and many more were created and used as a filter to eliminate those words from the sentence and give precedence to the important words in the training dataset. In addition to stop word list, duplicate words were also eliminated so as to create lexicon. The corresponding text file presented in appendix 2 was uploaded into WEKA by using 38 weka. core. Stop words. Words From Files to pword Handlerclass library. Figure 4.2 shows the most frequent words in dataset. Figure 4. 2: The most frequent words in dataset 4.1.4 Feature Representation, Extraction and Selection The features were extracted and represented into an efficient and complete manner in order to create numerical feature vector. The techniques which were adopted in this study to create feature vector and select features respectively are Bag-of-Words (BOW) model, N-gram, TF-IDF and Principal component Analysis. The BOW model represent features in an ordered set of words, disregarding grammar and even the exact position of the word. Each distinct word corresponds to a feature with the frequency of the word in the document as its value. Only words that did not occur in a stop list were considered as features. The BOW model was combined with N-grams model to provide the ability to identify n-word expressions. Unigram, Bi-gram and Tri-gram etc. were used to represent subsequence of continuous words in the text and were applied as unit of measurements towards classifiers‟ performance. The ratio of the number of times the resulting tokens occurred in the text string and its important was measured by using 39 TF-IDF. Only the relevant features were selected by filters called String to Word Vector Filter. The effective feature selection in text classification aimed at improving the efficiency of the learning task and overall accuracy. 4.2 Build and evaluate model by applying some machine learning algorithmsSelecting Machine learning algorithms and Configuration A total of four machine learning algorithms were selected from the literature and configured to carry out the experiment based on the dataset prepared in objective one. The choice MLAs were based on the task at hand, similar studies from the literature and nature of algorithms. The MLAs that were chosen and applied in this study are: i. Naïve Bayes. ii. Random Forest. iii. Decision Tree (J48) and iv. Support Vector Machine (SVM) with Linear, Polynomial and Radial Basis Function (RBF) Kernels. Performance Evaluation Metrics The evaluation metrics which were observed and recorded are True Positive (TP), True Negative (TN): False Positive (FP): accuracy, Precision-Recall, f1-measure (f1) and ROC AUC. The recorded results were based on 10-fold cross validation in some cases and by supplying a separate test set in other cases. As depicted in Figure 7 the training samples are not uniformly distributed, therefore accuracy alone was not considered to be a strong measure for the performance of the classifiers, thus the combination of accuracy, precision-recall and f1-measure were considered to convey 40 more information. Moreover the paired t-test was performed to analyze and compare the performance of the models on each fold based on the mentioned metrics. Experiment Result 4.2.1 Baseline classifiers performance The dataset was supplied to the baseline classifiers in order to determine the initial performance for evaluating the selected classifiers. Table 4. 4: Baseline performance Evaluation Metrics zeroR OneR Accuracy 39.2746 % 47.8756% TP Rate 0.393 0.479 FP Rate 0.394 0.338 Precision 0.307 0.547 Recall 0.393 0.479 ROC AUC 0.499 0.570 f1-Measure 0.277 0.377 From Table 4.4 it can be observed that oneR had better results as compared to zeroR. Therefore, the performances of the selected machine learning algorithms were evaluated by comparing them with the baseline performance. This finding helped to either reject or accept the selected algorithms for further evaluations in this experiment. A paired t-test was performed to evaluate the selected algorithms against the baseline and the results were as follows. 41 Figure 4. 3: Performance Evaluation on Accuracy From Figure 4.3 the result indicates that all of the selected algorithms are statistically better (v) than the baseline algorithm (oneR) at specified significance level of 0.05. In addition, the finding enabled the researcher to consider the selected algorithms for further experiments. 4.2.2 Detecting Offensive Messages The MLAs were trained to detect the presence or absence offensiveness in posts and comments. Due to the existence of many factors that affect the classifiers performance, the study strived to answer the following questions so as to properly address research objective two. Firstly, how the performances of the models are affected with the size of the training data? How data representations (Unigram, Bigram, Tri-gram and more) best describe the performance of the model? Is it easier to detect offensive messages within the specific categories of posts and comments (i.e. sexual or politics) than with a general-purpose model (configuration)? How the results of the model coincide with the results of other scholars in the literature? The answers to these questions are presented in the following section. 42 4.2.3 Size of Training Sample The effect of the size of the training data on the performances of the models were addressed by creating training sample of varying size and the Micro averaging f1measure of each model were recorded and plotted as shown in Table 6 and Figure 10 respectively. In WEKA tool a random samples of words to keep were selected as in Table 4.5. Table 4. 5: The performance of Classifiers on varying data size Data Size Micro Averaged f1-Measure Naïve J48 Bayes SVM- SVM- SVM- RBF Polynomial Linear Random Forest 0 0.000 00 0.000 0.000 0.000 0.000 1000 0.731 0.774 0.605 0.705 0.789 0.779 2000 0.736 0.781 0.678 0.724 0.806 0.798 4000 0.739 0.790 0.700 0.780 0.830 0.813 6000 0.741 0.799 0.800 0.870 0.869 0.850 7000 0.741 0.821 0.879 0.930 0.903 0.931 8000 0.750 0.83 0.861 0.941 0.932 0.950 43 Figure 4. 4: Classifiers Learning Rate From Figure 4.4 it can be observed that the classification accuracy (micro averaging) of some classifiers increases significantly with the increase of the size of dataset while decreases in some other classifiers. This finding signifies that MLAs are able to devise a better learning function as more examples are presented. By applying 10fold cross validation SVM-Linear, SVM-Polynomial Kernels and Random Forest provided reasonable results with SVM-Polynomial and Random Forest outperforming better than other classifiers. Since social network messages increases as more people are connected to the OSNs, this finding signifies that any of these generated models would work well in social network sites to detect offensive messages despite the increase of number of posts and comments. 4.2.4 Feature Representations The performances of the models were observed based on the feature representation using BOW model, TF-IDF while varying n-grams. The models were trained on total number of 8000 messages containing 37% offensive messages and 63% normal 44 messages and tested and evaluated using 10-fold cross validation as shown in Table 4.6 and Figure 4.5 respectively. Table 4. 6: Classifiers performance based on Features Representation Micro averaging f1-measure Classifiers BOW and TF-IDF 1-gram 2-gram 3-gram 4-gram Naïve Bayes 0.686 0.757 0.563 0.721 J48 0.774 0.312 0.230 0.277 SVM-RBF 0.709 0.440 0.500 0.480 SVM-Polynomial 0.914 0.848 0.756 0.774 SVM-Linear 0.900 0.836 0.755 0.774 Random Forest 0.905 0.844 0.757 0.773 Figure 4. 5: Classifiers performance based on Features Representation From Figure 4.5 the finding revels that with unigram (1-gram) feature representation, the f1-measure is higher with SVM-Linear with 90.000 %, SVMPolynomial 91.400% and Random Forest 90.500%. Furthermore, the finding indicates that to detect offensive messages only a keyword or phrase may be enough 45 to represent the posts or comments as either offensive or normal. Unigram combined with BOW and TF-IDF feature representation could save the purpose as compared to other representations. Since the study is about finding the a appropriate model for real time implementation, a paired t-test was performed in order to find if the observed differences in f1-measure with n-gram representation among SVM-Linear, SVMPolynomial and Random forest is statistically significant at specified 0.05 level of confidence. The results of the findings are indicates in the Figure 4.7. Figure 4. 6: Performance comparisons on n-gram feature From Figure 4.6 it can be observed that there exist no statistical differences on the performance among SVM-Polynomia Kernal and Random Forest that can be explained because they have yield the same results on four different datasets. The finding signifies that SVM-Normalized PolyKernel and Random Forest perform equally with either n-gram representaion as opposed to other classifiers that have worse results (*). 46 4.2.5 Categories The performance of the models were also evaluated by observing if it is easier to detect offensive messages within the specific categories of posts and comments (i.e. sexual or politics) than with a general-purpose model (configuration). This finding aimed at determining whether the system to be built for detecting offensive messages should be domain specific or just a general-purpose system that can serve to all categories of offensive in Kiswahili language. To perform this experiment two sets of training dataset were prepared, one containing 8000 dataset labelled as {1,0,2}and the second containing the same 8000 messages were re-labelled as 1 indicating offensive message and 0 as normal message {1,0}. The developed models were tested by supplying a separate test set and the observed results were recorded as in Table 4.7 and Table 4.8 respectively. Table 4. 7: Performance on Dataset with 3-categories Dataset with Sexual, Politics and Normal Categories {1,0,2} Evaluation Naïve Metrics Bayes Accuracy 68.757 % TP Rate J48 SVM RF Linear Polynomial RBF 78.964 94.197 % 94.696 % 76.42% 95.026 % 0.688 0.790 0.942 0.942 0.764 0.950 FP Rate 0.229 0.122 0.03 0.079 0.153 0.028 Precision 0.685 0.789 0.942 0.943 0.807 0.951 Recall 0.688 0.790 0.942 0.942 0.764 0.96 ROC AUC 0.848 0.905 0.963 0.933 0.815 0.994 f1-Measure 0.75 0.785 0.942 0.947 0.720 0.950 47 Table 4. 8: Performance for Dataset with 2-categories Dataset with 2-Categories {1,0} Evaluation Naïve Metrics Bayes Accuracy 82.1555% TP Rate J48 SVM RF Linear Polynomial RBF 80.772% 95.583 % 92.4028 % 78.3569% 94.258 % 0.822 0.808 0.956 0.924 0.784 0.943 FP Rate 0.320 0.250 0.112 0.235 0.679 0.156 Precision 0.820 0.806 0.956 0.930 0.817 0.943 Recall 0.822 0.808 0.956 0.924 0.784 0.943 ROC AUC 0.875 0.876 0.922 0.844 0.552 0.986 f1-Measure 0.821 0.804 0.955 0.919 0.673 0.941 Figure 4. 7: Performance for General purpose and Categorical models From Figure 4.7, it can be observed that by limiting training and testing dataset into a single category there is a significant difference on the performances of the classifiers. According to (Sood et al., 2012) on their study did not observe any significance difference regarding general purpose and categorical systems. The finding signifies that the choice of weather to build a categorical or general purpose system will depend on the decision from the one intending to implement the system 48 into real world environment. The implementation of two classes (1, 0) achieves better results when implemented with SVM-Linear, while SVM-Polynomial and Random Forest yield better results with more than two categories. Figure 4. 8: Comparison of False Positive Rate From Figure 4.8 a paired t-test was performed to compare the false positive rate on two different datasets from the baseline classifier. As observed the lowest FP rate from baseline was 0.02 (2.0%) this indicates that the two classifiers wrongly classifies normal messages as offensive by 2.0%, 1.6 % and 1.7 % respectively at specified significance level of 0.05. 4.2.6 Time Taken to Train and Test model Since the models utilize the CPU, the comparison analysis was performed to determine how the model utilizes the CPU during training and testing period as depicted in Figure 4.9. 49 Figure 4. 9: Time taken to train and test model From Figure 4.9 it can be observed that SVM-Polynomial is good at training time with average of 2.28 seconds and 2.49 seconds on two different datasets but in case of prediction time; Random Forest gives the best results with average of 0.15 seconds and 0.13 seconds respectively. 50 From the above graphs and tables, all four selected algorithms provided different results. It can also be observed that, Random Forest gives the slightly better performance in average of 95.03% correct classification as compared to SVMKernels functions and other algorithms. Since the Random Forest outperformed other classifiers, the model was selected as an input in the objective three. 4.3 Proposed framework for detecting offensive Kiswahili messages 4.3.1 Proposed Framework Architecture Figure 4. 10: Study Proposed Framework 4.3.2 Components details of the proposed framework The Figure 4.10 depicts the proposed framework toward detecting offensive Kiswahili messages from social network sites. The architecture consists of Input 51 section that contains raw messages originating from social network sites such as Facebook, Twitter, YouTube and JamiiForum. OSNs are connected to the REST API that provides a POST method to submit messages for the analysis. Some of the OSNs provide API to crawl data while others do not. In order to extract message the web crawler can be developed to link with the proposed API. Once a post is submitted into feature extraction module, feature vectors are constructed by using Bag-of Word Tokenizer (string to feature vectors), TF-IDF and unigram. The script then load the pre-stored Random Forest classifier model stored in a database server so as to classify the post as either offensive or normal. This part forms the processing component of the framework. The message may be extracted with its corresponding userID, which may then be the database once the message is classified as offensive. This part will help to save as evidence once requested. In addition, JSON object indicating as a message is offensive is then sent into the publicly pages as marking it as offensive this form output part of the API. Furthermore, for real time classification a more seamless browser application may be developed specifically for each browser that can communicate with the API to send the post to the back-end as described above. 4.3.3 Framework Properties In terms of reliability the proposed framework, is reliable since it depends on the evaluated model which could detect 95% offensive messages correctly. The model has achieved a false positive rate of 2.80% signifying that it wrongly classify normal message as offensive by 2.80%. Furthermore, the implementation of the proposed framework will result on correct predicting the offensive message by the Precision of 95.10% and Recall of 95%. Since the model implements Random Forest classifier 52 which uses bagging and bootstrap these techniques will minimize the error of incorrectly assignment of a message. For the case of robustness the framework proposed, is much of the open sources packages for implementing machine learning classifiers are available in Java, Python or R which may be integrated with databases servers to handle unstructured data, all of these makes the proposed framework robust because of the properties contained in these technologies. For issues related to usability, if the proposed framework is implemented correctly no user will experience any difficultness since the whole process is automated one and the output results will just appear on the page. 4.3.4 Implementation Consideration To implement the proposed application, detection and quick response factors need to be considered. In addition, real time classification should also be considered. This is because the process involves identifying the messages and returning the feedback on the client side as quick as possible to facilitate real time detection. The process may involve either marking the post or blocking the post before being shared. In social network sites not all contents posted are offensive. Moreover, OSNs allow people to communicate privately; it is therefore recommended that the application be designed to adhere with privately communicated content which users do not intend to share publicly. This will help to enhance the goals of OSNs. Computing resources and Bandwidth are among other considerations. The implementation of the above framework will need to be hosted in powerful server with reasonable storage capacity, RAM, processors and amount bandwidth. These 53 properties will facilitate faster computing process and storages of models, for quick responses. 4.4 Conclusion This chapter was about discussing the findings with respect to the research objectives and associated research questions set out in the introductory chapter. It has presented findings with respect to Kiswahili dataset, configuration and evaluation of the selected machine learning algorithms. Based on the study objective one and two the chapter presents the proposed framework for automating the detection of offensive messages. In the next chapter a summary of the study, conclusion, recommendations and areas for further research will be presented. 54 CHAPTER FIVE SUMMARY, CONCLUSION AND RECOMMENDATION 5.0 Introduction In this chapter a summary for the study, conclusions based on the findings presented and discussed in chapter four; recommendations and areas that may need further research are presented. 5.1 Summary of the Study The study specifically focused on designing a framework by applying Machine Learning Algorithms (MLA) that can automatic detect offensive massages on social networks in Kiswahili language. Specifically the study examined three research questions in order to accomplish the above objective namely; (i) How Kiswahili dataset of offensive messages from social networks can be created for generating feature vectors? (ii) How machine learning algorithms can be applied and evaluated in building the models which can detect offensive messages in Kiswahili? (iii) How an architecture framework for detecting offensive language on social networking sites can be designed? An experimental research design was applied; employing both primary and secondary data collected by means of software tool, observation and structured documentary review methods in order to address the above mentioned questions. With regard to the first objective which focused on creating Kiswahili dataset of offensive messages from social networks, a total of 12000 Kiswahili messages from Facebook, JamiiForum were collected in which 37% of the total messages were offensive (both from the perspective of sexuality and politics) while 63% of all messages were normal. The collected messages were given to three annotators to 55 manually assign the label1for sexual, 2 for politics and 0 for normal messages respectively. The result of the three annotators formed a ground truth for creating the quality Kiswahili dataset for training and testing MLAs. In response to the second objective which aimed at building and evaluating the model by applying some machine learning algorithms techniques , four classification algorithms were selected namely Naïve Bayes, Decision Tree (J48), Random Forest, Support Vector Machine with Linear, Polynomial and RBF kernels respectively. The findings revealed that in all of the applied machine learning algorithms in the experiment, Random Forest was capable of correctly assigning a message into a correct class with accuracy of 95.03% Recall of 95.00%, Precision of 95.10% and f1Measure of 0.950 (95.00%), false positive rate of 2.80 % and outperformed all other classifiers applied in the experiment. The first two objectives served as inputs to the third objective which aimed at proposing a framework for automating the detection of offensive messages in social networks under Kiswahili settings by applying some selected Machine Learning Algorithms. As depicted in section 4.3 the proposed framework is a RESTful API that takes post from social networks as input and passes it into a trained model stored in the database server. The model predicts the messages as either offensive or normal and stores the results into the database if it is offensive and send the mark as JSON string to the public page to mark the post or stop it from being spread. Otherwise it ignores the messages. 5.2 Conclusion Based on the study findings and analysis made the conclusions discussed hereunder are pertinent. 56 i. The study has created Kiswahili dataset containing sexually and politically offensive messages collected from few of the existing social networks. The study has examined only verbally presented messages despite the existence of messages in the form of images that are dominant and shared among people in social network. The researcher has also observed that reliable dataset contains huge amount of relevant messages are necessary for the machine learning algorithms to produce reliable results. ii. The study has applied supervised machine learning technique to build and evaluate few of the selected text classification algorithms. Through observed metrics the findings reveals the results from SVM-Polynomial Kernel (normalized) and Random Forest had reasonable results with Random forest outperforming slightly better than SVM-Polynomial with 0.05 confidence level and low standard derivation of 0.02-0.04. iii. The created Kiswahili dataset and evaluated Random Forest model formed important components among others in designing a framework that would help in mitigating the detection of offensive messages in social networks sites 5.3 Recommendations Based on the findings of the study and the conclusions drawn, the study recommends the following. First, social network provider should adopt the proposed framework and deploy it as the part of their application so as to mitigate offensive behaviors in social networks. This will help users to trust and use their application(s) and improve users‟ experiences on using online services. The model is also more robust and reliable as it eliminates manual activities in detecting offensive behaviors. 57 Second, the government may adopt the framework presented to create a system that stores evidence from users who behave offensively in social networks. The framework will eliminate human intervention that may be influenced by personal malice and hatred, reduce labor intensive methods of detecting offensive behaviors and encourage balanced judgments. Third, end-users should stop promoting offensive words which is against social network terms of service. An important aspect of the framework is to mark or block the offensive posts thus impressing on the end-users that they should adopt the implemented services in positive ways. 5.4 Area for Further Research Based on the findings of the study and the conclusions drawn, the study also identified areas that may require further research. i. A study needs to be conducted in order to implement the proposed framework in real-time social networks (Facebook, JamiiForum or YouTube) environments so as to assess the efficiency of the framework on detecting offensive messages in Kiswahili context. An API can be developed to implement the proposed framework by considering the discussed designing techniques. The study may strive to evaluate the metrics discussed in the study and measure computation time, delays and evaluates its adaptability in the Tanzania context. ii. In social networks, people are using multilingual message constructions such as code switching between Kiswahili and English in a single message and images to send offensive information. Moreover, people are using short form sentences to convey the messages that may be offensive. Due to those 58 mentioned issues a similar study may be conducted by adding multilingual, slangs, images to form dataset collected over long period of time so as to form feature vectors for the classification task. The study may apply the same or different algorithms or evaluate the results of other domain of machine learning such as clustering. iii. The study did not consider multi-labeling task, where by a single message may belong into different categories of offensiveness thus a similar study may be conducted by applying multi-labeling classification techniques. 59 REFERENCES Ali, J., Khan, R., Ahmad, N., & Maqsood, I. (2012). Random Forests and Decision Trees. IJCSI International Journal of Computer Science Issues, 9(5), 272–278. Alpaydın, E. (2010). Introduction to Machine Learning Second Edition (second). The MIT Press. Asur, S., & Huberman, B. A. (2010). Predicting the Future with Social Media. 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 492–499. https://doi.org/10.1109/WIIAT.2010.63 Barcaroli, G., Nurra, A., Scarnò, M., Summa, D., & Nazionale, I. (2014). Use of web scraping and text mining techniques in the Istat survey on “ Information and Communication Technology in enterprises .” In European Conference on Quality in Official Statistics. Biau, G. erard. (2012). Analysis of a Random Forests Model. Journal ofMachine Learning Research 13, 1063–1095. Breiman, L. E. O. (2001). Random Forests. Machine Learning,.45, 5–32. Retrieved from http://dx.doi.org/10.1023/A:1010933404324. Bretschneider, U., & Peters, R. (2017). Detecting Offensive Statements towards Foreigners in Social Media. In: Proceedings of the 50th Hawaii International Conference on System Sciences (HICSS), 2213–2222. Retrieved from uri: http://hdl.handle.net/10125/41423. Chao, W. (2011). Machine Learning Tutorial. DISP Lab, Graduate Institute of Communication Engineering, National Taiwan University, DISP Lab,. Retrieved from http://disp.ee.ntu.edu.tw/~pujols/Machine Learning Tutorial.pdf. Chen, Y. (2012). Detecting Offensive Language in Social Media to Protect Adolescent Online Safety.2012 ASE/IEEE International Conference on Privacy, Security, Risk and Trust, 71–80. https://doi.org/10.1109/ SocialCom-PASSAT.2012.55. Christopher, B. (2006). Pattern Recognition and Machine Learning. (J. Michael, K. Jon, & S. Bernhard, Eds.). Springer Science+Business Media, LLC. Creswel, J. (2014). Research Design: Qualitative,Quantitative, and Mixed Methods Approches (4th Editio). SAGE Publications. Criminisi, A., Shotton, J., & Konukoglu, E. (2012). Decision Forests : A Unified Framework for Classification , Regression , Density Estimation , Manifold Learning and Semi-Supervised Learning By Antonio Criminisi , Jamie Shotton , and Ender Konukoglu. Computer Graphics and Vision, 60 7(201), 81–227. https://doi.org/10.1561/0600000035. Danjuma, K., & Osofisan, A. (2015). Evaluation of Predictive Data Mining Algorithms in Erythemato-Squamous Disease Diagnosis. Dasgupta, A., & Nath, A. (2016). Classification of Machine Learning Algorithms, 3(3), 6–11. Dewan, P. (2017). Facebook Inspector ( FbI ): Towards automatic real-time detection of malicious content on Facebook. Social Network Analysis and Mining. https://doi.org/10.1007/s13278-017-0434-5. Ellison, N. B., & Boyd, danah m. (2008). Social Network Sites : Definition , History , and Scholarship. Journal of Computer-Mediated Communication 210 Journal, 13, 210–230. https://doi.org/10.1111/j.1083-6101.2007. 00393.x. Gerbet, T., & Kumar, A. (2014). ( Un ) Safe Browsing. Hee, C. Van, Lefever, E., Verhoeven, B., Mennes, J., & Desmet, B. (2015). Automatic Detection and Prevention of Cyberbullying. The First Inernational Conference on Human and Social Analytics, (c), 13–18. Hilte, L., Lodewyckx, E., Verhoeven, B., & Daelemans, W. (2016). A Dictionarybased Approach to Racism Detection in Dutch Social Media. Proceedings of the Workshop on Text Analytics for Cybersecurity and Online Safety (TA-COS 2016), 11–17. Hinnebusch, Thomas J. (2003). Swahili In William J. Frawley. International Encyclopedia of Linguistics (2 ed.). Oxford: Oxford University Press. Ian, W., & Frank, E. (2005). Data Mining Practical Machine Learning Tools and Techniques (second). by Elsevier Inc. Jia, S., Hu, X., & Sun, L. (2013). The Comparison between Random Forest and Support Vector Machine Algorithm for Predicting β -Hairpin Motifs. Engeineering, 5(October), 391–395. https://doi.org/.doi.org/10.4236 /eng.2013.510B079. Killam, R., Cook, P., & Stakhanova, N. (2016). Android Malware Classification through Analysis of String Literals. TA-COS 2016 – Text Analytics for Cybersecurity and Online Safety. Retrieved from http://www.ta-cos.org. Kotsiantis, S. B. (2007). Supervised Machine Learning : A Review of Classification Techniques, 31, 249–268. Retrieved from https://datajobs.com/datascience-repo/Supervised-Learning-[SB-Kotsiantis].pdf. Kumari, M., & Godara, S. (2011). Comparative Study of Data Mining Classification Methods in Cardiovascular Disease Prediction. International Journal of Computer Science and Technology, 2(2), 304–308. Retrieved from 61 http://ef.untz.ba/images/Casopis/Paper1Osmanbegovic.pdf. Laskari, N., & Sanampudi, S. (2016). Aspect based sentiment analysis. ThesisIOSR Journal of Computer Engineering (IOSR-JCE), 18(2), 72–75. https://doi.org/10.9790/0661-18212428. LegalAid. (2010). Social Networks Terms of Services. Liu, B. (2012). Sentiment Analysis and Opinion Mining “Synthesis lectures on human language technologies .” (S. Editor:, Ed.), AAAI-2011 Tutorial. Morgan & Claypool. https://doi.org/10.2200/S00416 ED1V01Y2012 04HLT016. Lutu, P. (2015). Web 2.0 Computing and SocialMedia as Solution Enablers for Economic Development in Africa. Computing in Research and Development in Africa: Benefits, Trends, Challenges and Solutions, (Springer International Publishing Switzerland 2015). https://doi.org/ DOI 10. 1007/978-3-319- 08239-4 6. Massamba, D.P.B., Y.M. Kihore, & J.I. Hokororo (eds) (1999) Sarufi miundo ya Kiswahili Sanifu : Sekondari na Vyuo. Dar es Salaam : Taasisi ya Uchunguzi wa Kiswahili, Chuo Kikuu cha Dar es Salaam Mulokozi,Mugyabuso (2000) Language, Literature and the Forging of aPan-African Identity. Kiswahili 63: 71-80. Msavange, M. (2015). Usage of Cell Phones in Morogoro Municipality , Tanzania. Journal of Information Engineering and Applications, 5(7), 52–66. Retrieved from www.iiste.org. Muhammad, I., & Yan, Z. (2015). SUPERVISED MACHINE LEARNING APPROACHES : A SURVEY, 946–952. https:// doi.org/10. 21917 /ijsc. 2015. 0133. Nakash, J., Anas, S., Ahmad, S. M., & Azam, A. M. (2015). Real Time Product Analysis using Data Mining. International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), 4(3), 815–820. Nasa, C. (2012). Evaluation of Different Classification Techniques for WEB Data. International Journal of Computer Applications, 52(9). Retrieved from http://www.ijcaonline.org/archives/volume52/number9/8233-1389. Nilsson, N. J. (2005). INTRODUCTION TO MACHINE LEARNING AN EARLY DRAFT OF A PROPOSED TEXTBOOK Department of Computer Science. Osmanbegović, E., & Suljić, M. (2012). Data mining approach for predicting student performance. Economic Review – Journal of Economics and Business, X(1), 3–12. Retrieved from http://ef.untz. ba/images/ Casopis/ Paper1 62 Osmanbegovic.pdf. Papegnies, E., Labatut, V., Dufour, R., Linar, G., Papegnies, E., Labatut, V., … Linar, G. (2017). Detection of abusive messages in an on-line community es To cite this version:Conference En Recherche d’Information et Applications, 0–16. Ramya, M., & Pinakas, J. (2014). Different Type of Feature Selection for Text Classification. Ijcttjournal.Org, 10(2), 102–107. Retrieved from http://www.ijcttjournal.org/Volume10/number-2/IJCTT-V10P118.pdf. Razavi, A. H., Inkpen, D., Uritsky, S., & Matwin, S. (2010). Offensive Language Detection Using Multi-level Classification. Advances in Artificial Intelligence: 23rd Canadian Conference on Artificial Intelligence. Read, J. (2016). Meka : A Multi-label / Multi-target Extension to Weka, 17, 1–5. Reynolds, K. (2012). Using Machine Learning to Detect Cyberbullying Kelly Reynolds. Saleem, H. M., Dillon, K. P., Benesch, S., & Ruths, D. (2016). A Web of Hate : Tackling Hateful Speech in Online Social Spaces. First Workshop on Text Analytics for Cybersecurity and Online Safety (TA-COS 2016), Proceeding. Sathya, R., & Abraham, A. (2013). Comparison of Supervised and Unsupervised Learning Algorithms for Pattern Classification, 2(2), 34–38. Saunders, M., Lewis, P., & Thornhill, A. (2009). Research Methods for Business Students. Research methods for business students (5 Edition). Essex, England: Pearson Education Limited. https://doi.org/10.1007/s13398014-0173-7.2. Seif, H. (2016). Na{"\i}ve Bayes and J48 Classification Algorithms on Swahili Tweets: Perfomance Evaluation. International Journal of Computer Science and Information Security, 14(1). Sood, S. O., Churchill, E. F., & Antin, J. (2012). Automatic identification of personal insults on social news sites. Retrieved from https://pdfs. semanticscholar. org/3fa4/d63e0194cdbd909c579456830e0a7c909242.pdf Srivastava, S. (2014). Weka : A Tool for Data preprocessing , Classification , Ensemble , Clustering and Association Rule Mining, 88(10), 26–29. https://doi.org/10.5120/15389-3809. Tanzania. (2015). The cybercrimes act, 2015, (14). TCRA. (2010). THE UNITED REPUBLIC OF TANZANIA Report on INTERNET AND DATA SERVICES IN TANZANIA A Supply-Side Survey, (September). 63 Tesha, T. (2015). The Impact of Transformed Features in Automating the Swahili Document Classification. International Journal of Computer Applications, 127 (16), 37–42. Tesha, T., & Baraka, K. (2015). Analysis of Tanzanian Biomass Consumption Using Artificial Neural. Fundamentals of Renewable Energy and Applications, 5(4). https://doi.org/10.4172/20904541.1000169. Vandersmissen, B. (2012). Automated detection of offensive language behavior on social networking sites. Universiteit gent. Vanhove, T., Leroux, P., Wauters, T., & Turck, F. (2013). Towards the Design of a Platform for Abuse Detection in OSNs using Multimedial Data Analysis. In Integrated Network Management. IFIP/IEEE International Symposium on Integrated Network Management (IM 2013). World Newsmedia Network. (2015). Global Social Media Trends 2015. European Publishers Council. Retrieved from http://epceurope.eu/wpcontent/uploads/2015/09/epc-trends-social-media.pdf. Wu, H. O. C., Wing, R., & Luk, P. (2008). Interpreting TF-IDF Term Weights as Making Relevance Decisions. ACM Transactions on Information Systems, 26(3), 1–37. https://doi.org/ 10.1145/1361684. 1361686. Xu, R., & Wunsch, D. (2005). Survey of Clustering Algorithms. IEEE TRANSACTIONS ON NEURAL NETWORKS,16(3), 645–678. Zhang, Y., & Haghani, A. (2015). A gradient boosting method to improve travel time prediction. TRANSPORTATION RESEARCH PART C. https:// doi.org/ 10.1016/j. trc.2015.02.019. 64 APPENDICES Appendix 1: Sample arff Message file 65 Appendix 2: Sample list of stop words na ili hivyo kwa au hiyo ama letu hizo ndiyo kwenye ikiwa haya kwamba ipo la iwe kati aghalabu kama naye ukubwa hasa katika kuwa huku kila sana ajili kuna lakini baadhi mwa si hayo sasa cha hii sisi za vya mimi yote sababu wewe yetu ni nyie ya yenye ninyi wote yenyewe siyo wa bali wengi yake hili wenye yao hivi wingi 66 yoyote zetu ndiye baadaye ingekuwa nyingi hata petu hadi hali hawa huyu hicho hizi hilo halisi baada hiki ambapo yuko huo huyo nyingine nzuri chake zote yupo wakati ikawa ambacho pia ambayo akiwa chenye ila ile ambaye pa tu zipo ziko hako nao yale vizuri vingi kingi huu kubwa watakuwa uko ukubwa nzuri kizuri ambako ambao ambazo hapa hapo nao nalo husika haba nacho nani uwe hakika halafu up ule yeye hao yapo yule huna yaweyana huko kile humo hana awapo zake wale chetu yana 67 yako wao nina yangu wapi nini zozote wangu nipo zile wana ndogo zenye wala ndio zikiwa vyao ndipo zangu vile ndivyo zaidi vema nayo yeyote upo nazo yangu una lisilo yasiyo u litakuwa yaliyokuwa tuna lini yaliyopo the lipo yamekuwa tena likiwa yaani tags lile yako send lilikuwa wetu sawa limekuwa wenyewe peke licha wenzake pekee lenye wenzao papo ni wengine pale huyo wawe ole wa wasio nzima na 68 wala tired tena aidha like sana vilevile fuck tulia zaidi gdnyt pana bali fakers kubwa lakini kwan vip wewe ama wapi ww n oho ni mpaka duh kwa and ahaa naye but kweli yeye bt kwake wao or kwani jana yangu kwanini leo jamani kote kesho weka juu usiku namba je mchana kwake is asubuhi vile isiyo au haujui itakuwa vp huyu ina ww jamani ingawa am sisi lkn 69 wenzetu kumi kwao wapo ishirini kwenda wako thelathini wakiwa arobaini wake hamsini vyote tisini vipi sitini vingine sabini upya themanini to mia moja elfu mbili sio tatu on nne mzima tano mzuri sita saba namba nane namna tisa mwao 70 Appendix 3: Corrections Report as per External Supervisors Observation Presented Chapters (a) Chapter Three Comments from External Supervisor To avoid Research Methodology Methodology literature Correction Area done by Candidate turning Mentioned part as parts review. Page number. Page number were 29 (section by 3.1-3.3) corrected Recommended to either using citations delete, shift into literature for or use for justification justifications (b) Chapter Four: To improve Quality of Results Quality and some figures figures Discussion (c) Dissertation of Page number were 48, 51, improved Grammar and All of the noted Presentation other issues and Writing highlighted in the rectified, dissertation issues Entire were dissertation including use chapter-wise spacing, format for figures comma, and tables improve quality of some figures 71 grammar issues etc. (3,18,19)