BANGLA FAKE N EWS D ETECTION USING M ACHINE L EARNING , D EEP N EURAL N ETWORK AND T RANSFORMER MODELS by Anower Hossen Zihad 18102008 BACHELOR OF SCIENCE IN COMPUTER SCIENCE AND ENGINEERING . Chittagong Independent University Chittagong, Bangladesh November, 2022 This project titled, “Bangla Fake News Detection using Machine Learning, Deep Neural Network and Transformer models”, submitted by Anower Hossen Zihad, Roll No.: 18102008, Session: September 2018, has been accepted as satisfactory in partial fulfillment of the requirement for the degree of BACHELOR OF SCIENCE in Computer Science and Engineering on 10th November, 2022. BOARD OF EXAMINERS Risul Islam Rasel Assistant Professor SSE HOD(CSE) (Supervisor) ii Candidate’s Declaration This is to certify that the work presented in this thesis entitled, “Bangla Fake News Detection using Machine Learning, Deep Neural Network and Transformer models”, is the outcome of the research carried out by Anower Hossen Zihad under the supervision of Risul Islam Rasel, Assistant Professor, School of Science and Engineering, Chittagong Independent University, Chittagong-4000, Bangladesh. It is also declared that neither this thesis nor any part thereof has been submitted anywhere else for the award of any degree, diploma, or other qualifications. Signature of the Candidate Anower Hossen Zihad 18102008 iii Dedication This study is dedicated to my parents for always believing in me. iv Contents Certification ii Candidate’s Declaration iii Dedication iv List of Figures vii List of Tables viii Acknowledgement ix Abstract x 1 Introduction 1.1 Problem Statement . . . 1.2 Motivation of the Study . 1.3 Scope of the Study . . . 1.4 Objectives of the Project 1 1 1 2 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Literature Review 4 3 Data Description and Feature Selection 3.1 Data Collection and Merging . . . . . . . . . . . . 3.1.1 Primary collection of fake news [19]: . . . 3.1.2 BanFakeNews: [20] . . . . . . . . . . . . . 3.1.3 Bangla Fake-Real News Small Dataset: [6] 3.2 Data Pre-Processing: . . . . . . . . . . . . . . . . 3.2.1 Stop Words Removal . . . . . . . . . . . . 3.2.2 Punctuation Removal . . . . . . . . . . . . 3.2.3 Numeric Digits Removal . . . . . . . . . . 3.2.4 Special Character/Emoticons Removal . . . 3.3 Feature Extraction: . . . . . . . . . . . . . . . . . v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 8 8 8 9 9 10 10 10 11 11 3.3.1 3.3.2 Count Vectorizer . . . . . . . . . . . . . . . . . . . . . . . . . 11 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4 Methodology 4.1 Machine Learning Classifier Algorithms: . . 4.1.1 Multinomial naive Bayes . . . . . . 4.1.2 Logistic Regression . . . . . . . . . 4.1.3 K-Nearest Neighbors . . . . . . . . 4.1.4 Support Vector Machine . . . . . . 4.1.5 Decision Tree . . . . . . . . . . . . 4.1.6 AdaBoost . . . . . . . . . . . . . . 4.2 Deep Neural Netowork Classifiers: . . . . . 4.2.1 Convolutional Neural Network . . . 4.2.2 Long Short-Term Memory network 4.2.3 Bidirectional LSTM . . . . . . . . . 4.2.4 CNN-LSTM . . . . . . . . . . . . . 4.2.5 CNN-BiLSTM . . . . . . . . . . . 4.3 Transformer Classifiers: . . . . . . . . . . . 4.3.1 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Experimental Results 5.1 Result on Final Dataset: . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Recall and Accuracy Comparison on Seperate Datasets: . . . . . . . . 5.3 Comparison of the Proposed Models with Previous Best Performing Studies: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 ROC curves and AUC of proposed models . . . . . . . . . . . . . . . 5.4.1 ROC curves and AUC of ML classifiers . . . . . . . . . . . . 5.4.2 ROC curves and AUC of DNN classifiers . . . . . . . . . . . 5.4.3 ROC curves and AUC of Transformer models . . . . . . . . . 5.5 Developing a Web App and Integrating the model . . . . . . . . . . . . . . . . . . . . . . . . . . 13 14 14 14 15 16 16 17 18 18 19 20 21 22 22 23 25 . 25 . 26 . . . . . . 27 28 28 28 29 30 6 Conclusions 33 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 6.2 Future Prospects of Our Work . . . . . . . . . . . . . . . . . . . . . . . 33 References 34 vi List of Figures 3.1 Sample Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 Architecture of the proposed models . K-Nearest Neighbors . . . . . . . . . Support Vector Machine . . . . . . . . Decision Tree . . . . . . . . . . . . . Decision Tree . . . . . . . . . . . . . Convolutional Neural Network . . . . Long Short-Term Memory unit . . . . Bidirectional LSTM . . . . . . . . . . Architecture of CNN-LSTM model . . Architecture of CNN-BiLSTM model The Transformer – Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 15 16 17 18 19 20 20 21 22 23 5.1 5.2 5.3 5.4 5.5 5.6 ROC and AUC- ML Classifiers . . . . . . . ROC and AUC- DNN Classifiers . . . . . . ROC and AUC- Bangla-BERT . . . . . . . Abstract workflow of the app . . . . . . . . Example of real news prediction in web app Example of fake news prediction in web app . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 29 29 30 31 32 vii . . . . . . . . . . . . . . . . . . . . . . List of Tables 3.1 Dataset description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 5.2 5.3 Classification report of all the models . . . . . . . . . . . . . . . . . . . 25 Performance of the ML and DNN models on seperate datasets . . . . . 26 Comparison with previous two benchmark scores . . . . . . . . . . . . 27 viii 9 Acknowledgement I want to start by thanking Allah for his grace, which allowed me to complete my Bachelor’s degree without experiencing major setbacks. I also want to express my gratitude to my project supervisor Mr. Risul Islam Rasel. All along the journey, sir served as my mentor. Without the assistance of my experienced supervisor, it would be challenging to prepare a solid research paper. He has made himself available whenever I needed help and guided me all the way from gathering data to creating classifier models and then, finally writing a research paper. ix Abstract News Categorization is one of the primary applications of Text Classification. In this era of virtualization, ensuring factual information circulation is more necessary than ever. Even though, there are hundreds of studies on this field in languages that are rich in resources, there has not been much significant work in Bangla. And the reasons for that are- lack of resources and limited language processing tools. For this study, a portion of fake news data are collected and then merged with the datasets of Bangla fake news that are publicly available. The final dataset has 4678 news in total of which half are real and the other half are fake. The dataset is experimented with multiple Machine Learning(LR, SVM, KNN, MNB, Adaboost and DT), Deep Neural Network(LSTM, BiLSTM, CNN, LSTM-CNN, BiLSTM-CNN) and Transformer(Bangla-BERT, m-BERT) models to attain some state of the art results. Three of the best performing models areCNN, CNN-LSTM, BiLSTM with 95.9%, 95.5%, and 95.3% accuracy respectively. All of the developed ML, DNN and Transformer models are also applied to the previously existing datasets separately and 1.4% to 3.8% improvement in accuracy is seen. Besides accuracy, the models show great increase in recall on fake news data compared to the previous studies. x CHAPTER 1. INTRODUCTION 1 Chapter 1 Introduction Fake news classification is the task of identifying real and fake news given a set of news as input. Because of the internet- e-papers, online news portals, and social media have become substitutes for newspapers in many ways. The flow of information has never been this easy. This free flow of information comes at the cost of misinformation getting circulated from many sources. The circulation of fake news, ill-intended agendas, or distasteful satire creates confusion, conflict, and unrest among mass people. In the worst cases, they can lead to violence. We have witnessed communal violence fueled by fake news on social media many times before from Ramu [1] to Bhola [2]. There are many other instances [3] of violence and unrest sparked by fake news on social media or online news portals. Therefore, monitoring virtual space and filtering out fake news is a necessity. 1.1 Problem Statement Bangla Fake News Detection using Machine Learning, Deep Neural Network and Transformer models 1.2 Motivation of the Study Text Classification has been a hot topic in Natural Language Processing(NLP) for a while. Over the past few decades, a variety of statistical and machine learning techniques have been used to extract significant features and correctly classify textual data. CHAPTER 1. INTRODUCTION 2 Bag of Words, TF-IDF and Word embedding (Word2Vec) are such kind of feature extraction techniques. These features have been used to train supervised learning models like LR, SVM, or Naive Bayes to categorize the documents. Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) are two examples of deep learning algorithms that have been used to extract useful features from unstructured textual data to categorize the documents. Even though, less in number, there have been some good works in Bangla language processing as well. However, most of the applications of the modern Deep Learning and Transformer models are still not being taken full advantage of mainly because of the scarcity of collected data and limitation of language processing tools. In the fake news classification category, there have been a handful of works using ML models but no mentionable work with DL and Transformers technology. Moreover, previous works in this field were done using imbalanced datasets and a limited collection of news. 1.3 Scope of the Study Whilst expanding news collection has been a challenge, a primary dataset is added to the previously available datasets of real and fake Bangla news. Then the final dataset is balanced so that there are the same number of data for each label. In this work, I experiment with traditional ML models, and DNN models with and without ensemble. Finally, transformers- mBERT and Bangla-BERT are applied to attain some interesting results. Comparative results of all these models are presented to determine the best performing model. Six ML, five DL, and two Transformer model are used to train and predict fake news. Then the best performing model is integrated into a web app that can predict real or fake news data. 1.4 Objectives of the Project The main addition of this study can summarized by the following: • Expand previously available Bangla real and fake news datasets by adding collected fake news and merging all of them together. • Build ML, DNN, and Transformer models to categorize Bangla real and fake news. CHAPTER 1. INTRODUCTION 3 • Present comparable analysis and result of the developed models on the final dataset as well as comparing the performance of the models on seperate datasets to find out whether the proposed models have a better prediction accuracy than previously proposed models. • Intgrate the best performing model into a web app that can predict real or fake news data. CHAPTER 2. LITERATURE REVIEW 4 Chapter 2 Literature Review There are some works of Bangla fake news classification with ML models but DNN and Transformers models are not utilised to their full potential yet. In this section, the works that are related to my work are feature Title of the paper Field of text classification No. of classes Dataset Report Algos Result Limitations Detection of Bangla Fake News using MNB and SVM Classifier Fake news detection 2 Fake news=993 Real news=1548 SVM MNB 96.67% 93% i.Limited data ii.Imbalanced dataset iii.Duplicate data Using Social Networks to Detect Malicious Bangla Text Content Spam detection 2 Spam=646 Ham=1319 MNB 82.42% i.Very limited data ii.Only one ML model employed CHAPTER 2. LITERATURE REVIEW 5 Title of the paper Field of text classification No. of classes Dataset Report Algos Result Limitations Hateful Speech Detection in Public Facebook Pages for the Bengali Language Hate Speech Detection 2 Hate speech =3127 Nonhateful comment =1999 SVC NB RF AB GRUbased 48% 24% 52% 39% 70% i.Underwhelming performance Evaluating Machine Learning Algorithms For Bengali Fake News Detection - Fake news detection 2 Real News= 239 Fake News= 239 SVM LR RF VEC GNB 57.32% 78.6% 61.1% 76.3% 84.4% i.Very limited data ii.Only news headlines iii.Only ML models FakeDetect: Bangla Fake News Detection Model based on Different Machine Learning Classifiers Fake news detection 2 Real News= 49.5k Fake News= 2.3k PAC MNB SVM LR DT RF 93.8% 86.9% 93.5% 92.5% 86.8% 93.2% i.Imbalanced dataset ii.Duplicate data Bengali Fake News Detection Fake news detection 2 Total= 726 LR 82.90% i. Limited data ii. No clear dataset description CHAPTER 2. LITERATURE REVIEW 6 Title of the paper Field of text classification No. of classes Dataset Report Algos Result Limitations A Study towards Bangla Fake News Detection Using Machine Learning and Deep Learning Fake news detection 2 Real News= 55k Fake News= 7k SVM GRU CNN BiLSTM text 91% 70.1% 96% 96% i.Imbalanced dataset ii.Duplicate data iii.Reusing different version of same dataset Bengali text document categorization based on very deep convolution neural network Text Categorization 13 969,000 text documents of 13 categories GloVe+CNN+LSTM GloVe+VDCNN m-BERT 76.96% 95.2% 92.45% -No mentionable limitation Satire Detection 2 Satire News=1480 Real News=1480 CNN 0.96% i. Limited data Automatic Detection of Satire in Bangla Documents: A CNN Approach Based on Hybrid Feature Extraction Model - +Fast- CHAPTER 2. LITERATURE REVIEW 7 Title of the paper Field of text classification No. of classes Dataset Report Algos Result Limitations Bangla Fake News Detection Based On Multichannel Combined CNN-LSTM Fake news detection 2 Fake news=48k Real news=1.3k CNN-LSTM 75.05% i.Imbalanced dataset ii.Duplicate data Bangla Text Classification using Transformers News Categorization 6 14106 news of 6 catgories BERT-base XLMRoBERTa-base XLM-RoBERTalarge 91.28% 92.7% 93.4% No mentionable limitation Abusive Bangla comments detection on Facebook using transformerbased deep learning models Spam detection 2 44001 facebook comments in total BERT ELECTRA 85% 84.92% i. Data split unknown Bangla-BERT: TransformerBased Efficient Model for Transfer Learning and Language Understanding Fake news detection 2 Real news=48k Fake news=1.3k Bangla-BERT Accuracy=i.Imbalanced 99.2% dataset Reii.Duplicate call= data 94.2% iii.Poor recall - CHAPTER 3. DATA DESCRIPTION AND FEATURE SELECTION 8 Chapter 3 Data Description and Feature Selection 3.1 Data Collection and Merging Real news data are abundant but the scarcity of publicly available Bangla fake news collection makes the work challenging. Moreover, collecting Bangla fake news has not been easy as all the possible fake news sources are blocked by the government. I had to use the Internet Archive(https://archive.org/ ) to go back in the timeline and collect fake news data. Then I combined the dataset I created with the two publicly available datasets of Bangla news, both of which contain labeled real and fake news. 3.1.1 Primary collection of fake news [19]: I have collected 500 Bangla fake news data via Internet Archive. These news are mostly from the domains- ChannelDhaka, Earki, Motikontho. These websites contains fake news of two sorts- misinformation and satire. 3.1.2 BanFakeNews: [20] This dataset contains 48000 real news from popular Bangla newspapers like Kalerkantho, Prothom Alo, etc. and 1300 fake news from different news portals and e-papers such as Motikontho, Banglabeats, Bangaliviralnews, Shadhinbangla24, Prothombhor and more. However, among the 1300 fake news, there are 123 duplicate entries of the same news which are removed. CHAPTER 3. DATA DESCRIPTION AND FEATURE SELECTION 3.1.3 9 Bangla Fake-Real News Small Dataset: [6] This dataset has additional 993 fake news and around 1500 real news from similar sources as the other datasets. This dataset contained 45 duplicate fake news as well. Lastly, after merging all the datasets another duplicate check is performed to remove redundancy among different datasets. The final dataset contains 2339 fake news. The number of real news is also adjusted so that the proportion is right with the fake news. From the final dataset 25% of data are kept aside for testing and the other 75% are used in training. The final dataset has following structure: Number of attributes Category Training data size Test size Real News 1754 585 3 data Total Ratio of Training and Test data 75 : 25 Fake News 1754 585 Table 3.1: Dataset description 3.2 Data Pre-Processing: A necessary step when working with text in Natural Language Processing(NLP) is text cleaning or text pre-processing. This kind of noisy text data need to be cleaned before feeding them into the machine learning model since real-life human-written text often contains words with incorrect spellings, short words, special symbols, emojis, etc. Text pre-processing also ensures better results from the classifiers in most of the cases. There might be unnecessary symbols, emoticons, or stopwords in texts which do not help the models in classification and instead increase noise. To reduce the noise I removed columns that are unrelated to the study and null values. All kinds of punctuations such as !, ’, , . etc. and special characters like %, (, ¡, etc. are removed from the texts as well. Stopwords are removed using the BNLP toolkit’s collection of Bangla stopwords. Unfortunately, I could not find a stemmer for the Bangla language that works properly, so no stemmer is used on the dataset. CHAPTER 3. DATA DESCRIPTION AND FEATURE SELECTION 10 Figure 3.1: Sample Pre-processing 3.2.1 Stop Words Removal One of the preprocessing techniques that is most frequently utilized across many NLP applications is stop word removal. The simple idea is to exclude words that appear frequently throughout all of the corpus’s documents. Pronouns and articles are typically categorized as stop words. These words are not highly discriminative because they have little relevance in some NLP tasks like information retrieval and classification. I used BNLP tookit’s Bangla stopwords collection to eliminate the frequently used, but not so polarising words. 3.2.2 Punctuation Removal The removal of punctuation marks, which are used to separate text into sentences, paragraphs, and phrases, is an essential NLP preprocessing step. Since punctuation marks are frequently used in text, their removal has an impact on the outcomes of any text processing approach, particularly those that depend on word and phrase occurrence frequencies. 3.2.3 Numeric Digits Removal Depending on the use cases, numbers occasionally don’t contain any essential information in the text. A study of fake news detection falls into one those cases. Therefore, getting rid of them is preferable to keeping them. CHAPTER 3. DATA DESCRIPTION AND FEATURE SELECTION 3.2.4 11 Special Character/Emoticons Removal Non-alphanumeric characters are known as special characters. The most common places to find these characters are in comments, references, currency figures, etc. These characters introduce noise into algorithms and don’t improve text comprehension. However, these letters and symbols can be removed using regular-expressions (regex). Any possible use of emoticons are also eliminated from the texts in a similar manner. 3.3 Feature Extraction: The feature extraction process has been carried out on input texts for the Machine Learning classifier algorithms. Count Vectorizer and TF-IDF Vectorizer (Term FrequencyInverse Document Frequency) are used for extracting features and to be represented with numerical values. Count Vectorizer is used to convert a given text into a vector based on the number of times each word appears across the full text. A vocabulary of size 86795 feature words is created from the unique words in the corpus. Tf-idf, short for Term Frequency-Inverse Document Frequency, is a metric that quantifies the significance of a word in a corpus or collection of documents. The 86k words are used to extract unigram features. 3.3.1 Count Vectorizer Characters and words are not understood by machines. So, in order for a machine to understand text data, it must be represented in numerical form. Text can be transformed into numerical data with the Countvectorizer tool. Text data can be used directly in machine learning and deep learning models, including text categorization, thanks to Countvectorizer. 3.3.2 TF-IDF In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that CHAPTER 3. DATA DESCRIPTION AND FEATURE SELECTION 12 some words appear more frequently in general. Term frequency, tf(t,d), is the relative frequency of term t within document d, ft,d t′ ∈d ft′ ,d tf(t, d) = P (3.1) where ft,d is the raw count of a term in a document, i.e., the number of times that term t occurs in document d. Note the denominator is simply the total number of terms in document d (counting each occurrence of the same term separately). The inverse document frequency, idf is a measure of how much information the word provides, i.e., if it is common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient): idf(t, D) = log N |{d ∈ D : t ∈ d}| (3.2) Then tf–idf is calculated as tfidf(t, d, D) = tf(t, d) · idf(t, D)tfidf(t, d, D) = tf(t, d) · idf(t, D) (3.3) A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of document. CHAPTER 4. METHODOLOGY Chapter 4 Methodology Figure 4.1: Architecture of the proposed models 13 CHAPTER 4. METHODOLOGY 4.1 14 Machine Learning Classifier Algorithms: Six common Machine Learning Classifiers such as, Multinomial Naive Bayes(MNB), Logistic Regression(LR), K Nearest Neighbour(KNN), Support Vector Machine(SVM, Decision Tree(DT) and AdaBoost are applied to find out performance on the news dataset. The parameters are examined and tuned to get optimized results. alpha=1.0 is set for MNB. LR uses lbfgs solver, default penalty and c values. KNN shows best result when n neighbour is set to 12. SVM uses rbf kernel and random state 7. DT operates with gini criterion to measure the quality of the split. Adaboost performs best when n estimators is set to 100. 4.1.1 Multinomial naive Bayes With a multinomial event model, samples (feature vectors) represent the frequencies with which certain events have been generated by a multinomial (p1 , . . . ,pn ) where pi is the probability that event i occurs (or K such multinomials in the multiclass case). A feature vector x =(x1 ,. . . ,xn ) is then a histogram, with xi counting the number of times event i was observed in a particular instance. This event model represents the occurrence of a word in a single document and is frequently used for document classification. The likelihood of observing a histogram x is given by P n ( ni=1 xi )! Y xi pki p(x | Ck ) = Qn i=1 xi ! i=1 (4.1) If a given class and feature value never occur together in the training data, then the frequency-based probability estimate will be zero, because the probability estimate is directly proportional to the number of occurrences of a feature’s value. This is problematic because it will wipe out all information in the other probabilities when they are multiplied. Therefore, it is often desirable to incorporate a small-sample correction, called pseudocount, in all probability estimates such that no probability is ever set to be exactly zero. This way of regularizing naive Bayes is called Laplace smoothing when the pseudocount is one, and Lidstone smoothing in the general case. 4.1.2 Logistic Regression Predictive analytics and categorization frequently make use of this kind of statistical model, also referred to as a logit model. Based on a given dataset of independent CHAPTER 4. METHODOLOGY 15 variables, logistic regression calculates the likelihood that an event will occur, such as raining and not raining. Given that the result is a probability, the dependent variable’s range is 0 to 1. In logistic regression, the odds—that is, the probability of success divided by the probability of failure—are transformed using the logit formula. This logistic function is expressed by the following formulas and is also frequently referred to as the log odds or the natural logarithm of odds.: p(x) = 4.1.3 1 1 + e−(x−µ)/s (4.2) K-Nearest Neighbors An example of a supervised learning method used for both classification and regression is K-nearest neighbors (KNN). The distance between the test data and all of the training points is calculated by KNN in an effort to predict the proper class for the test data. Then, K number of points that is most similar to the test data are selected. The KNN algorithm estimates the likelihood that the test data belong to each of the K training data classes, and the class with the highest likelihood is chosen. The ideal choice of k depends on the data; typically, higher values of K lessen the impact of noise on classification but obscure the lines between classes. A good K can be chosen using a variety of heuristic methods. Figure 4.2: K-Nearest Neighbors CHAPTER 4. METHODOLOGY 4.1.4 16 Support Vector Machine A supervised machine learning approach called Support Vector Machine (SVM) is used for both classification and regression. Although we also refer to regression problems, categorization is the most appropriate term. Finding a hyperplane in an N-dimensional space that clearly classifies the data points is the goal of the SVM method. The number of features determines the hyperplane’s size. The hyperplane is essentially a line if there are just two input features. The hyperplane turns into a 2-D plane if there are three input features. Imagining something with more than three features gets challenging. Figure 4.3: Support Vector Machine 4.1.5 Decision Tree A decision support tool known as a decision tree employs a tree-like model to illustrate options and their potential outcomes, including utility, resource costs, and chance occurrence outcomes. One technique to show an algorithm that solely uses conditional control statements is to use this method. Although they are also a common technique in machine learning, decision trees are frequently employed in operations research, notably in decision analysis, to help find a plan that is most likely to succeed. A decision tree is a flowchart-like structure in which each internal node represents a ”test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from CHAPTER 4. METHODOLOGY 17 root to leaf represent classification rules. In decision analysis, a decision tree and the closely related influence diagram are used as a visual and analytical decision support tool, where the expected values (or expected utility) of competing alternatives are calculated. Figure 4.4: Decision Tree 4.1.6 AdaBoost Adaptive Boosting, often known as AdaBoost, is a statistical classification meta-algorithm. The performance can be enhanced by combining it with a variety of other learning methods. The results of the other learning algorithms, or ”weak learners,” are merged to create a weighted total that represents the boosted classifier’s final results. Although AdaBoost can be used to multiple classes or bounded intervals on the real line, it is often seen for binary classification. AdaBoost is adaptive in that it modifies future weak learners in favor of instances that prior classifiers incorrectly classified. It may be less prone to the overfitting issue than other learning algorithms in particular situations. It can be demonstrated that the final model converges to a strong learner even if the performance of each individual learner is just marginally better than random guessing. CHAPTER 4. METHODOLOGY 18 Figure 4.5: Decision Tree 4.2 Deep Neural Netowork Classifiers: Different Deep Neural Networks (DNN) such as Convolutional Neural Network(CNN), Long Short-Term Memory(LSTM), Bidirectional Long Short-Term Memory(BiLSTM), and a combination of CNN with LSTM(CNN-LSTM) and CNN with BiLSTM(CNNBiLSTM) are used to conduct experimentation on our data. Before data is fed to the DNN models they are transformed into numerical values with help of One Hot Encoding. Maximum news length is set to 300 words, vocabulary size=86000 and I use pre padding while encoding. All the models use adam optimizer with the default learning rate of 0.001. These models are trained for 12 epochs with a batch size of 64. Earlystopping(ES) is incorporated into the models to avoid overfitting. ES looks to improve val accuracy with a min delta of 0 and patience=3. 4.2.1 Convolutional Neural Network Convolutional Neural Network (ConvNet/CNN) is a Deep Learning method that can take in an input image and assign importance (learnable weights and biases) to different aspects and objects in the image whilst also being able to distinguish between them. Multilayer perceptrons are transformed into CNNs. Fully connected networks, or multilayer perceptrons, are those in which every neuron in one layer is connected to every neuron in the following layer. Due to their ”full connectivity,” these networks are vulnerable to data overfitting. Regularization or overfitting prevention methods frequently involve penalizing training parameters (such as weight decay) or cutting connectivity (skipped connections, dropout, etc.) By utilizing the hierarchical structure in the data and assembling patterns of increasing complexity using smaller and simpler patterns CHAPTER 4. METHODOLOGY 19 embedded in their filters, CNNs adopt a novel strategy for regularization. CNNs are therefore at the lower end of the connectivity and complexity spectrum. Figure 4.6: Convolutional Neural Network A three-layer Convolutional Neural Network (CNN) is used with three different kernel sizes such as 4, 6, and 8. These layers use 32 filters each. A dropout layer with 0.5 rate is introduced to skip overfitting. Downsampling of the features is done by max-pooling layer with the pool size 2. Lastly, relu activation function adds non-linearity, and the probability distribution of the classes is calculated in the final step using an output layer with sigmoid activation. 4.2.2 Long Short-Term Memory network : Long Short-Term Memory network (LSTM) is a special kind of Recurrent Neural Network(RNN) capable of learning long-term dependencies. LSTM features feedback connections as opposed to typical feedforward neural networks. Such a recurrent neural network (RNN) may analyze whole data sequences in addition to single data points (such as photos) (such as speech or video). A cell, an input gate, an output gate, and a forget gate make up a standard LSTM unit. The three gates control the flow of data into and out of the cell, and the cell remembers values across arbitrary time intervals. Since there may be lags of uncertain length between significant occurrences in a time series, LSTM networks are well-suited to classifying, processing, and making predictions based on time series data. CHAPTER 4. METHODOLOGY 20 Figure 4.7: Long Short-Term Memory unit In this work, the embedding layer has 40 embedding vector features with a max input length of 300 words. I apply 100 LSTM hidden units, default dropout 0, and finally sigmoid activation function. 4.2.3 Bidirectional LSTM Bidirectional LSTM(BiLSTM) is a sequence processing model made up of two LSTMs, one of which processes data in a forward way and the other in a backward direction. It can use data from both sides and, unlike standard LSTM, the input flows in both directions. In both directions of the sequence, it is an effective tool for modeling the sequential dependencies between words and phrases. Figure 4.8: Bidirectional LSTM CHAPTER 4. METHODOLOGY 21 In conclusion, BiLSTM reverses the direction of information flow by adding one more LSTM layer. It simply means that in the additional LSTM layer, the input sequence flows backward. The outputs from the two LSTM layers are then combined using a variety of methods, including average, sum, multiplication, and concatenation. The implementation of BiLSTM model is quite similar to the LSTM model of this study with the same parameter values. 4.2.4 CNN-LSTM The embedding layer is followed by a 1D convolutional layer with 32 filters of size three and a 1D max-pool layer of pool size 2. It uses relu activation function to add non-linearity. After that LSTM model of 100 units is added with the sigmoid activation function. Figure 4.9: Architecture of CNN-LSTM model CHAPTER 4. METHODOLOGY 4.2.5 22 CNN-BiLSTM CNN-BiLSTM also follows the CNN-LSTM architecture with the only difference being, that the convolutional layer is added on top of the BiLSTM model instead of LSTM model. Figure 4.10: Architecture of CNN-BiLSTM model 4.3 Transformer Classifiers: The NLP Transformer is a new architecture that tries to solve problems sequence-tosequence while resolving long-distance dependencies with ease. It solely relies on selfattention to compute the input and output representations, employing neither convolutions nor sequence-aligned RNNs. CHAPTER 4. METHODOLOGY 23 Figure 4.11: The Transformer – Model Architecture A Multi-Head Attention layer precedes a layer of Feed Forward Neural Network in the encoder block. On the other hand, the decoder features an additional Masked MultiHead Attention. Several identical encoders and decoders are layered on top of one another to form the encoder and decoder blocks. The number of units in both the encoder stack and the decoder stack is equal. A hyperparameter is the sum of the units of encoder and decoder. 4.3.1 BERT Google created Bidirectional Encoder Representations from Transformers (BERT), a transformer-based machine learning approach for pre-training natural language processing (NLP). BERT is at its core a transformer language model with a variable number of encoder layers and self-attention heads. There are two models of the original English- CHAPTER 4. METHODOLOGY 24 language BERT: 12 encoders with 12 bidirectional self-attention heads comprise the BERT-BASE, while 24 encoders with 16 bidirectional self-attention heads comprise the BERT-LARGE. 4.3.1.1 mBERT [21] The transformer model m-BERT is pre-trained over 104 languages taking about 110M parameters into account. bert-base-multilingual-cased is employed on fined-tuned on the news dataset. The batch size is 32 trained in five epochs. 4.3.1.2 Bangla-BERT [18] Bangla-Bert is a pre-trained language model of Bengali language using mask language modeling. I use sagorsarker/Bangla-bert-base model and fine-tune it so that the best performance can be obtained in our dataset. Batch size is set to 8 while training. CHAPTER 5. EXPERIMENTAL RESULTS 25 Chapter 5 Experimental Results 5.1 Result on Final Dataset: We can see the performance of all the models of three approaches(ML, DNN, and Transformer) in Table 5.1. Method ML models DNN models Transformer models Classifier Pr Re f1-score Acc SVM 95.2 95.2 95.2 95.2 LR 94.1 94.1 94.1 94.1 MNB 91.6 91.5 91.4 91.5 KNN 88.9 88.5 88.5 88.5 DT 85 85 85 85 Adaboost 84.9 84.9 84.9 84.9 CNN 95.9 95.9 95.9 95.9 LSTM 94.5 94.4 94.4 94.4 BiLSTM 95.3 95.3 95.3 95.3 CNN+LSTM 95.5 95.5 95.5 95.5 CNN+BiLSTM 94.9 94.9 94.9 94.9 Bangla-BertBase 94.1 94.1 94.1 93.3 mBert 93.9 93.9 93.9 93.8 Table 5.1: Classification report of all the models It is evident that among all the Machine Learning approaches SVM achieves better scores on Pr(95.2%), Re(95.2%), f1-score(95.2%), and Acc(95.2%) than all other mod- CHAPTER 5. EXPERIMENTAL RESULTS 26 els. LR is the closest one to that score with an f1-score score(94.1%) of SVM. MNB, KNN, DT, Adaboost have f1-score score of (91.5%),(88.5%),(85%), and (84.9%), respectively. DNN models have a significant improvement in terms of results. The CNN model has the best accuracy of 95.9% among all the models. But, CNN-LSTM and BiLSTM models are not far from the CNN model with an accuracy of 95.5% and 95.3% respectively. LSTM has 94.5% and CNN-BiLSTM offers 94.9% accuracy. Transformer models perform moderately on the fake news dataset. Bangla BERT model scores 94.1% accuracy whereas mBERT makes it to 93.9% 5.2 Recall and Accuracy Comparison on Seperate Datasets: I apply fake news detection models built for this study on publicly available datasets separately. This helps me to find out the efficiency of my models by comparing them with previous benchmark scores. Dataset Description Criterion ML models MNB LR BanFakeNews [20] 55k real, 2k fake (with duplicate) Bangla FakeReal News Small Dataset [6] 1.5k real, 1k fake (with duplicate) Final Dataset 2.3K real, 2.3k fake (no duplicate) DNN models KNN SVM DT AB LSTM Bi LSTM CNN CNN- CNNLSTM BiLSTM Recall(fake 46.6 news) 47.8 35.9 87.6 93 92.5 91.5 99.4 94 90.3 92.9 95.8 97.8 97 99.4 98.9 99 99.4 92.5 99.8 99.4 99.6 Recall(fake 77.1 news) 88.4 88.4 91.6 97 96.4 98.8 96.8 92.5 95.6 99.6 95.4 95.3 96.8 98.1 98 97.5 97.5 96.2 97.6 88.5 Recall(fake 88.2 news) 94.1 83.3 96 86 85.6 93.8 94.2 96.3 93.5 94.9 94.1 88.5 95.2 85 84.9 94.4 95.3 95.9 95.5 94.9 93.3 Accuracy Accuracy Accuracy 91 Table 5.2: Performance of the ML and DNN models on seperate datasets CHAPTER 5. EXPERIMENTAL RESULTS 27 From Table 5.2 we can see that ML classifier models suffer badly in detecting fake news. That’s why recall of fake news is low. Despite that these ML models don’t fail to score a high accuracy because of a massively imbalanced dataset. Thus, the accuracy does not represent actual fake news detection efficiency. However, on the BanFakeNews dataset, DNN models perform well: BiLSTM model with 99.4% recall on fake news and CNN with 99.8% highest accuracy. The Bangla Fake-Real News Small Dataset [6] has a similar pattern in terms of recall with LSTM scoring 98.8% on fake news recall, whereas Decision Tree surprisingly has the best accuracy of 98.1%. The best performing model on the Final Dataset which combines three following datasets such as- BanFakeNews, Bangla Fake-Real News Small Dataset, and Fake500 is CNN with an accuracy of 95.9%. BiLSTM has a recall of 96.3% of fake news detection on that dataset. Due to computational limitations, the BanFakeNews dataset has not been trained with transformer models, so that comparison is kept out of this table. 5.3 Comparison of the Proposed Models with Previous Best Performing Studies: Dataset BanFakeNews [20] Classifier Accuracy Previous Best Performing models Bi-LSTM with Fasttext 96 Best performing model of this study LSTM 99.8 SVM 96.7 Adaboost 98.1 Previous Best PerBangla Fake-Real Small Dataset [6] forming models Best performing model of this study Table 5.3: Comparison with previous two benchmark scores In Table 5.3, it is shown that, in the biggest collection of news dataset BanFakeNews, [12] gains up to 96% accuracy with GloVe and FastText on top BiLSTM model. My proposed LSTM model has a 3.8% improvement in accuracy on that dataset. The Bangla Fake-Real Small Dataset showed 96.67% accuracy with SVM model in [6], which is very much consistent with my finding with the SVM model. However, in my experiment, Adaboost performs better on this dataset with an accuracy of 98.1% CHAPTER 5. EXPERIMENTAL RESULTS 28 setting a better benchmark for the dataset. 5.4 ROC curves and AUC of proposed models A Receiver Operator Characteristic(ROC) curve is a graphical representation of True Positive Rate plotted against False Positive Rate, which is used to show the diagnostic ability of binary classifiers. And the area underneath the ROC curve is called Area Under Curve(AUC). The closer the curve is to the top-left corner, the better the performance. AUC scores are added to the legends of the classifiers. 5.4.1 ROC curves and AUC of ML classifiers SVM and LR classifiers have the best ROC curves among ML algorithms. Figure 5.1: ROC and AUC- ML Classifiers 5.4.2 ROC curves and AUC of DNN classifiers ALL the DNN classifiers have a very similar pattern which makes it look like they are on top of one another. CHAPTER 5. EXPERIMENTAL RESULTS Figure 5.2: ROC and AUC- DNN Classifiers 5.4.3 ROC curves and AUC of Transformer models Bangla-BERT has an AUC score of 0.925. Figure 5.3: ROC and AUC- Bangla-BERT 29 CHAPTER 5. EXPERIMENTAL RESULTS 5.5 30 Developing a Web App and Integrating the model From experimentation I have found that the CNN model has the best accuracy and f-1 score among all models. Therefore, I designed and implemented a web app, and then integrated this model to the app which can give real-time result of whether a news is real or fake. The app is deployed on heroku with following url: https://bfake.herokuapp.com To develop the web app following technology has been used: • Flask: A small scale python web app framework. I used it to build the backend server. • HTML, CSS, Bootstrap: To create the web page architecture and style it. • Ajax: Ajax is used to show prediction of the given input in same page without having to redirect/reload. Internal workflow of the is given below: Figure 5.4: Abstract workflow of the app CHAPTER 5. EXPERIMENTAL RESULTS 31 Output Example: An authentic news collected from Prothom Alo from the url https://www.prothomalo.com/world/pakistan/q5tlwy3fqo is give 0.74 prediction score (i.e. real news), which is the correct output. Figure 5.5: Example of real news prediction in web app CHAPTER 5. EXPERIMENTAL RESULTS A satire from Earki is predicted as fake news with a score of 0.14. Figure 5.6: Example of fake news prediction in web app 32 CHAPTER 6. CONCLUSIONS 33 Chapter 6 Conclusions 6.1 Conclusions This study combines prior research conducted by other researchers, then delivers a proper and better corpus of Bangla fake news and real news. Previous works in the field of Bangla fake news detection gravely suffer from poor data quality like- duplicate entries and greatly imbalanced datasets. I remove duplicate data and fix the ratio of fake and real news. Then I merge them with my collection of fake news data. Out of the three approaches taken in this study, DNN models are the most remarkable ones with better performance scores. CNN shows the most promising result. Among the ML models, SVM performs exceptionally well, which is also seen in past works. The transformer models show a good outcome as well. My models produce a state-of-the-art result for the previously available fake news datasets with a 1.4% to 3.8% increase in accuracy and much better recall on detecting fake news. 6.2 Future Prospects of Our Work To improve upon this study more data can be added to the corpus from sources that are different from the ones in the corpus. In the future, I aim to develop a web app with the best performing model to predict the authenticity of news given input. REFERENCES 34 References [1] “Buddhist temples, homes burned, looted in Ramu’ bdnews24.com. https://bdnews24.com/bangladesh/buddhist-temples-homes-burned-looted-inramu [2] “4 die in Bhola as violence erupts over FB https://m.theindependentbd.com/arcprint/details/220317/2019-10-21 post” [3] “Mobs beat five dead for ‘kidnapping” thedailystar.net https://www.thedailystar.net/frontpage/news/mobs-beat-2-dead-kidnapping1774471 [4] : Ahmed, Hadeer, et al. ‘Detecting Opinion Spams and Fake News Using Text Classification’. Security and Privacy, vol. 1, no. 1, Jan. 2018, p. e9. DOI.org (Crossref), https://doi.org/10.1002/spy2.9. [5] Umer, Muhammad, et al. ‘Fake News Stance Detection Using Deep Learning Architecture (CNN-LSTM)’. IEEE Access, vol. 8, 2020, pp. 156695–706. IEEE Xplore, https://doi.org/10.1109/ACCESS.2020.3019735. [6] M. G. Hussain, M. Rashidul Hasan, M. Rahman, J. Protim and S. Al Hasan, ”Detection of Bangla Fake News using MNB and SVM Classifier,” 2020 International Conference on Computing, Electronics Communications Engineering (iCCECE), 2020, pp. 81-85, doi: 10.1109/iCCECE49321.2020.9231167. [7] Islam, Tanvirul, et al. ‘Using Social Networks to Detect Malicious Bangla Text Content’. 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT), 2019, pp. 1–4. IEEE Xplore, https://doi.org/10.1109/ICASERT.2019.8934841. [8] Ishmam, Alvi Sharmin, Sadia. (2019). Hateful Speech Detection in Public Facebook Pages for the Bengali Language. 555-560. 10.1109/ICMLA.2019.00104. REFERENCES 35 [9] Mugdha, Shafayat Bin Shabbir, et al. ‘Evaluating Machine Learning Algorithms For Bengali Fake News Detection’. 2020 23rd International Conference on Computer and Information Technology (ICCIT), 2020, pp. 1–6. IEEE Xplore, https://doi.org/10.1109/ICCIT51783.2020.9392662. [10] F. Islam et al., ”Bengali Fake News Detection,” 2020 IEEE 10th International Conference on Intelligent Systems (IS), 2020, pp. 281-287, doi: 10.1109/IS48319.2020.9199931. [11] Sraboni, Tasnuba, et al. FakeDetect: Bangla Fake News Detection Model Based on Different Machine Learning Classifiers. Brac University, 2021. dspace.bracu.ac.bd, http://dspace.bracu.ac.bd/xmlui/handle/10361/14979. [12] Hossain, Elias Kaysar, Md Jalal Uddin Joy, Abu Zahid Rahman, Md. Mizanur Rahman, Md. (2021). A Study Towards Bangla Fake News Detection Using Machine Learning and Deep Learning. 10.1007/978-981-16-5157-1 7. [13] M. Z. H. George, N. Hossain, M. R. Bhuiyan, A. K. M. Masum and S. Abujar, ”Bangla Fake News Detection Based On Multichannel Combined CNNLSTM,” 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT), 2021, pp. 1-5, doi: 10.1109/ICCCNT51525.2021.9580035. [14] Sen, Arnab Mridul, Maruf Islam, Md Saiful. (2019). Automatic Detection of Satire in Bangla Documents: A CNN Approach Based on Hybrid Feature Extraction Model. [15] Hossain, Md. Rajib, et al. ‘Bengali Text Document Categorization Based on Very Deep Convolution Neural Network’. Expert Systems with Applications, vol. 184, Dec. 2021, p. 115394. ScienceDirect, https://doi.org/10.1016/j.eswa.2021.115394. [16] Alam, Tanvirul Khan, Akib Alam, Firoj. (2020). Bangla Text Classification using Transformers. [17] Aurpa, Tanjim Taharat, et al. ‘Abusive Bangla Comments Detection on Facebook Using Transformer-Based Deep Learning Models’. Social Network Analysis and Mining, vol. 12, no. 1, Dec. 2021, p. 24. Springer Link, https://doi.org/10.1007/s13278-021-00852-x. [18] M. Kowsher, A. A. Sami, N. J. Prottasha, M. S. Arefin, P. K. Dhar and T. Koshiba, ”Bangla-BERT: Transformer-Based Efficient Model for Transfer Learning and REFERENCES 36 Language Understanding,” in IEEE Access, vol. 10, pp. 91855-91870, 2022, doi: 10.1109/ACCESS.2022.3197662. [19] Anower Hossen Zihad. 500 Bangla Fake News. Kaggle. DOI.org (Datacite), https://doi.org/10.34740/KAGGLE/DSV/4222728. Accessed 19 Sept. 2022. [20] BanFakeNews. https://www.kaggle.com/datasets/cryptexcode/banfakenews. Accessed 19 Sept. 2022. [21] Devlin, Jacob, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. 2018. DOI.org (Datacite), https://doi.org/10.48550/ARXIV.1810.04805.