A PROJECT REPORT ON INFORMATION RETRIEVAL FROM MICROBLOGS DURING DISASTERS Mini project submitted in partial fulfillment of the requirements for the award of the degree of BACHELOR OF TECHNOLOGY IN INFORMATION TECHNOLOGY (2019-2023) BY S. SOURAV CHOUDHARY T. SAI RAJEEV A. NIKHIL CHUKKA SACHIN 19241A12A7 19241A12B4 19241A1261 19241A1272 Under the Esteemed guidance of K.SWAPNIKA, Assistant Professor Dept of IT. i DEPARTMENT OF INFORMATION TECHNOLOGY GOKARAJU RANGARAJU INSTITUTE OF ENGINEERING AND TECHNOLOGY (AUTONOMOUS) HYDERABAD CERTIFICATE This is to certify that it is a bonafide record of Mini Project work entitled “INFORMATION RETRIEVAL FROM MICROBLOGS DURING DISASTERS” done by S.SOURAV CHOUDHARY(19241A12A7),T.SAI RAJEEV(19241A12B4), A. NIKHIL(19241A1261), CHUKKA SACHIN(19241A1272), of B.Tech (IT) in the Department of Information Technology, Gokaraju Rangaraju Institute of Engineering and Technology during the period 2019-2023 in the partial fulfillment of the requirements for the award of degree of BACHELOR OF TECHNOLOGY IN INFORMATION TECHNOLOGY from GRIET, Hyderabad. ii K. SWAPNIKA, Dr.N.V.GANAPATI RAJU, Assistant Professor Head of the Department, (Internal project guide) (Project External) . ACKNOWLEDGEMENT We take the immense pleasure in expressing gratitude to our Internal guide, K. SWAPNIKA, Assistant Prof, DEPT of IT, GRIET. We express our sincere thanks for her encouragement, suggestions and support, which provided the impetus and paved the way for the successful completion of the project work. We wish to express our gratitude to Dr.N.V.GANAPATI RAJU, our Project Co-coordinators K.Swanthana, for their constant support during the project. We express our sincere thanks to Dr. Jandhyala N Murthy, Director, GRIET, and Dr. J. Praveen, Principal, GRIET, for providing us the conductive environment for carrying through our academic schedules and project with ease. We also take this opportunity to convey our sincere thanks to the teaching and non-teaching staff of GRIET College, Hyderabad. iii Email:souravc492@gmail.com Email. nikhiladdla@gmail.com Contact No:7013154142 Contact No: 6281474767 Address: Ameerpet, Hyderabad. Address: Ameerpet, Hyderabad Email: sairajeev1234@gmail.com Contact No: 8639799683 Address: Nizampet, Hyderabad. Email: sachinchukka02@gmail.com Contact No: 9347680442 Address: Lingampally, Hyderabad iv DECLARATION This is to certify that the project entitled “INFORMATION RETRIEVAL FROM MICROBLOGS DURING DISASTER” is a bonafide work done by us in partial fulfillment of the requirements for the award of the degree BACHELOR OF TECHNOLOGY IN INFORMATION TECHNOLOGY from Gokaraju Rangaraju Institute of Engineering and Technology, Hyderabad. We also declare that this project is a result of our own effort and has not been copied or imitated from any source. Citations from any websites, books and paper publications are mentioned in the Bibliography. This work was not submitted earlier at any other University or Institute for the award of any degree. S.SOURAV CHOUDHARY 19241A12A7 T.SAI RAJEEV 19241A12B4 A.NIKHIL 19241A1261 CHUKKA SACHIN 19241A1272 v TABLE OF CONTENTS Serial no Name Page no Certificates ii Contents v Abstract viii 1 INTRODUCTION 1 1.1 Introduction to project 1 1.2 Existing System 1 1.3 Proposed System 2 2 REQUIREMENT ENGINEERING 3 2.1 Hardware Requirements 3 2.2 Software Requirements 3 3 LITERATURE SURVEY 4 4 TECHNOLOGY 6 5 DESIGN REQUIREMENT ENGINEERING 13 5.1 UML Diagrams 13 5.2 Use-Case Diagram 13 5.3 Sequence Diagram 15 5.4 Activity Diagram 16 5.5 Class Diagram 18 5.6 Architecture 20 6 IMPLEMENTATION 21 6.1 Procedure 21 vi 6.2 Sample Code 22 7 SOFTWARE TESTING 30 7.1 Unit Testing 30 7.2 Integration Testing 30 7.3 Acceptance Testing 31 7.4 Testing on our system 32 8 RESULTS 33 9 CONCLUSION AND FUTURE ENHANCEMENTS 35 10 BIBLIOGRAPHY 37 vii 11. LIST OF FIGURES S No Figure Name Page no 1 Text Classifications 2 2 Classification of Text 9 3 Transformer-model Architecture 11 4 Encoders in a BERT 12 5 Use Case Diagram 14 6 Sequence Diagram 15 7 Activity Diagram 17 8 Class Diagram 18 9 Architecture 20 ABSTRACT In last few years, microblogging sites like Twitter have been evolved as a repository of critical situational information during various mass emergencies. However, messages posted on microblogging sites often contain non-actionable information such as sympathy and prayer for victims. viii Moreover, messages sometimes contain rumours and overstated facts. In such situations, identification of tweets that report some relevant and actionable information is extremely important for effective coordination of post-disaster relief operations. Thus, efficient IR methodologies are required to identify such critical information. Additionally, cross-verification of such critical information is a practical necessity to ensure the trustworthiness. Our present study provides a brief description of the tasks (research problem) given in these tracks, a summary of methodologies of all submitted runs and finally a brief description of our proposed methodologies to address the research problems of IRMidis track. Domain : DEEP learning, Natural language processing, BERT ix 1. INTRODUCTION 1.1 Introduction to Project Microblogging sites like Twitter are increasingly being used for aiding relief operations during various mass emergencies. A lot of critical situational information is posted on microblogging sites during disaster events. However, messages posted on microblogging sites often contain rumors and overstated facts. In such situations, identification of claims or fact-checkable tweets, i.e., tweets that report some relevant and verifiable fact (other than sympathy or prayer) is extremely important for effective coordination of post-disaster relief operations. 1.2 Existing Systems There has been lot of recent interest in addressing various challenges on microblogs posted during disaster events, such as classification, summarization, event detection, and so on . Some datasets of social media posts during disasters have also been developed, but they are primarily meant for evaluating methodologies for classification among different types of posts (and not for retrieval methodologies). Few methodologies for retrieving specific types of microblogs have also been proposed, such as tweets asking for help, and tweets reporting infrastructure damage . However, all such studies have used different datasets. To our knowledge, there is no standard test collection for evaluating strategies for 1 microblog retrieval in a disaster scenario; this work attempts to develop such a test collection. Many existing techniques are available for text processing like word2vector, SVM classifier, POS tagging, word embeddings, CBOW model. 1.3 Proposed System The proposed system is followed by tweets and then Cleaning of data (words, sentences), removing Stopwords tasks are applied. A model is created with the help of pre-trained BERT model. Custom Activation function and Loss Function are provided. Tweets Sentence1 Sentence2 Sentence 3 Fact Text Classification Non-fact Fig 1 Text Classification Fig 1 shows that our system working is based on the binary classification of the text. We are identifying important phrases or sentences from the original text preform encoding so that it’s better used by the models. 2 2. REQUIREMENT ENGINEERING 2.1 Hardware Requirements Processor – i3 and above (64-bit OS). Memory – 4GB RAM (Higher specs are recommended for high performance) Input devices – Keyboard, Mouse 2.2 Software Requirements Windows/Mac Visual studio code Python3 Tensorflow,.Keras Sklearn, pandas , numpy libraries 3 3. LITERATURE SURVEY During late 1950 text processing using NLP was introduced and before that time it is based on the statistical methods which is published in the year 1958. Those methods mainly involved selecting large blocks of text for generating relative and rational abstracts for the same. Furthermore, this work progressed to better and more persuasive results with the help of graph based ranking model for text processing and with Maximal Marginal Relevance (MMR) criterion as detailed. Meanwhile, evaluation measures like Bilingual Evaluation Understudy (BLEU) were invented that determined how well an automatic summary covered the matter, present in an original text. They also differentiated datasets like the DUC series, Medline, TAC series and many other were developed so that comparison and contrasting of various summarization methods would be possible for any kind of large text. One very broad and highly active field of research in AI (artificial intelligence) is NLP: Natural Language Processing. Scientists have been trying to teach machines how to understand and even write natural languages (such as English or Chinese) since the very beginning of computer science and artificial intelligence. One of the founding fathers of artificial intelligence, Alan Turing, suggested this as a possible application for the “learning machines” he imagined as early as the late 1940s. Other pioneers, such as Claude Shannon, who founded the mathematical theory of information and communication, have also suggested natural languages as a playground for the application of information technology and computer science. 4 The world has moved on since the days of these early pioneers, and today we use NLP solutions without even realizing it. We live in the world Turing dreamt of, but are scarcely aware of doing so! The history of NLP is long and complex, involving several techniques once considered state of the art that now are barely remembered. Certain turning points in this history changed the field forever, and focused the attention of thousands of researchers on a single path forward. In recent years, the resources required to experiment and forge new paths in NLP have largely only been available out with academia. Such resources are most available to private hi-tech companies: hardware and large groups of researchers are more easily allocated to a particular task by Google, Facebook and Amazon than by the average university, even in the United States. Consequently, more and more new ideas arise out of big companies rather than universities. In NLP, at least two such ideas followed this pattern: the word2vec and BERT algorithms. The former is a word embedding algorithm devised by Tomas Mikolov and others in 2013. Another important class of algorithms – BERT – was published by Google researchers in 2018. Within just a few months these algorithms replaced previous NLP algorithms in the Google Search Engine. In both cases, the researchers released their solutions as open source, disclosing results, datasets and of course, the full code. Such rapid progress and impact on widely-used products is amazing and worthy of deeper analysis. This article will offer hints for developers who wish to play with this new tool. 5 4. TECHNOLOGY 4.1 ABOUT PYTHON Python is powerful and fast, plays well with others, is user friendly and easy to learn, and is open source. It is an all-around valuable programming language used in Dialog flow. It is used as a base for the most prominent Abased programming in light of its versatility, straightforwardness and longstanding reputation. Python is an interpreter, high-level, general-purpose programming language. Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. Python's simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. Python supports modules and packages, which encourages program modularity and code reuse. Often, programmers fall in love with Python because of the increased productivity it provides. Since there is no compilation step, the edit-test-debug cycle is incredibly fast. Debugging Python programs is easy: a bug or bad input will never cause a segmentation fault. Instead, when the interpreter discovers an error, it raises an exception. When the program doesn't catch the exception, the interpreter prints a stack trace. A source level debugger allows inspection of local and global variables, evaluation of arbitrary expressions, setting breakpoints, stepping through the code a line at a time, and so on. The debugger is written in Python itself, testifying to 6 Python's introspective power. On the other hand, often the quickest way to debug a program is to add a few print statements to the source: the fast edit-test-debug cycle makes this simple approach very effective. 4.2 APPLICATIONS OF PYTHON One significant advantage of learning Python is that it’s a general-purpose language that can be applied in a large variety of projects. Below are just some of the most common fields where Python has found its use: Data science Scientific and mathematical computing Web development Computer graphics Basic game development Mapping and geography (GIS software) 4.3 PYTHON IS WIDELY USED IN DATA SCIENCE Python’s ecosystem is growing over the years and it’s more and more capable of the statistical analysis. It’s the best compromise between scale and sophistication (in terms OD data processing). Python emphasizes productivity and readability. Python is used by programmers that want to delve into data analysis or apply statistical techniques (and by devs that turn to data science). There are plenty of Python scientific packages for data visualization, machine 7 learning, natural language processing, complex data analysis and more. All of these factors make Python a great tool for scientific computing and a solid alternative for commercial packages such as MatLab. The most popular libraries and tools for data science are: 4.3.1 PANDAS Pandas name is derived from the term “panel data”, an econometrics term for multidimensional structured data sets. It is a library for data manipulation and analysis. The library provides data structures and operations for manipulating numerical tables and time series. It is also known as “Python Data Analysis Library” 4.3.2 NUMPY NumPy is a general-purpose array-processing package. It provides a highperformance multidimensional array object, and tools for working with these arrays.This is a fundamental package for scientific computing with Python, adding support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays. 4.3.3 MATPLOTLIB Matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib allows you to generate plots, histograms, power spectra, bar charts, error charts, scatterplots, and more. 4.3.4 SCIKIT-LEARN Scikit-learn are a machine learning library. It features various classification, 8 regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. 4.3.5 CSV The so-called CSV (Comma Separated Values) format is the most common import and export format for spreadsheets and databases. The csv module implements classes to read and write tabular data in CSV format. It allows programmers to say, “write this data in the format preferred by Excel,” or “read data from this file which was generated by Excel,” without knowing the precise details of the CSV format used by Excel. Programmers can also describe the CSV formats understood by other applications or define their own special-purpose CSV format. 4.4 Dataset Description The data contains around 10,000 microblogs (tweets) from Twitter that were posted during the Disasters from the year 2015-2017 from IRMiDis FIRE2021. Along with the dataset, sample of few claims or fact-checkable tweets and non factcheckable tweets are present. The dataset is a text files, in the following format: Tweetid <||>Location <||>Keywords <||> Tweettext <||> Target. Example of claim or fact-checkable: ibnlive <||> Nepal earthquake: Tribhuvan International Airport bans landing of big aircraft. 9 Text from Microblogs Factual Non-Factual Fig 2 Classification of Text Fig 2 shows that there are two kinds of text classification namely Factual and NonFactual. 4.5 Working of a BERT Model BERT is a transformer-based architecture. Transformer uses a self attention mechanism, which is suitable for language understanding. The transformer has an encoder-decoder architecture. They are composed of modules that contain feed-forward and attention layers. 10 Fig 3 Transformer-model Architecture In order to have a deeper sense of language context, BERT uses bidirectional training. Sometimes, it’s also referred to as “non-directional”. So, it takes both the previous and next tokens into account simultaneously. BERT applies the bidirectional training of Transformer to language modeling, learns the text 11 representations. BERT is just an encoder. It does not have a decoder. BERT is a multi-layered encoder. In that paper, two models were introduced, BERT base and BERT large. The BERT large has double the layers compared to the base model. By layers, we indicate transformer blocks. BERT-base was trained on 4 cloud-based TPUs for 4 days and BERT-large was trained on 16 TPUs for 4 days. Fig 4 Encoders in a BERT As you can see from the above image, the BERT base is a stack of 12 encoders. Each of them is a transformer block. The input has to be provided to the first encoder. The BERT encoder expects a sequence of tokens. It uses two methods: MLM (Masked LM) and NSP (Next Sentence Prediction). MLM (Masked Language Modelling): Some percentage of words are randomly masked by replacing the words with the token [MASK]. It is then trained to predict these masked words using the context form the remaining words. Next Sentence Prediction (NSP): To understand the relationship between two sentences, BERT uses NSP training. The model receives pairs of sentences as input, and it is trained to predict if the second sentence is the next sentence to the first or not. During training, we 12 provide 50-50 inputs of both cases. The assumption is that the random sentence will be disconnected from the first sentence in contextual meaning. 5. DESIGN REQUIREMENT ENGINEERING Concept of uml : UML is a standard language for specifying, visualizing, constructing, and documenting the artifacts of software systems.ML stands for Unified Modeling Language.UML is different from the other common programming languages such as C++, Java, COBOL, etc.UML is a pictorial language used to make software blueprints.There are a number of goals for developing UML but the most important is to define some general purpose modeling language, which all modelers can use and it also needs to be made simple to understand and use. UML DIAGRAMS: 5.1 Use case Diagram: A use case diagram in the Unified Modeling language (UML) is a type of behavioral diagram defined by and created from a Use-case analysis. Its purpose is to present a graphical overview of the functionality provided by a system in terms of actors, their goals (represented as use case), and any dependencies between those cases. The main purpose of a use case diagram is to show system functions are performed for which 13 actor. Roles of the actors in the system can be depicted. Fig 5 Use Case Diagram Fig 5 shows the use case diagram of our system which describes the interaction between actors which are the one will interact with the subjects. In our project there are mainly two actors involved in it namely User and System, i.e.,the Model. In this diagram we will going to see the interaction involved to them. 14 Initially, User tweets in which our we take that data from twitter. Then System collects data and performs necessary processes. Finally, the Model tool display result, then user can view it which is our required output. So, these diagram helps our model the interaction between the system and the user. 5.2 Sequence Diagram: A sequence diagram in UML is a kind of interaction diagram which shows how each process of the system operates with one another and in what order. It is a constructed as a message sequence chart. Sequence diagrams are sometimes called event diagrams or timing diagrams. 15 Fig 6 Sequence Diagram Fig 6 shows the Sequence diagram of our system which is an interaction diagram in which some sequence of information flow from one object to another object in a specified order to represent the time order of a process. It aims at a specific functionality of a model. There are mainly five timelines in our project, they are as follows namely User, MicroBlog, PreProcessor ,BERT and Model. The information will flow from a timeline from one by one. Initially, User tweets on a microblog site as input and the data is taken from the microblog(twitter in this case). With the help of preprocessing, data is cleaned, if it is ok, then it continues further. Then BERT encodes the text and performs NSP,MLM. Then the model will perform its process in order to get binary classification. So these diagram represents the sequence of our information flowing from one object to another object. 5.3 Activity diagram: Activity diagram is another important behavioral diagram in uml diagram to describe dynamic aspects of the system. Activity diagram is essentially an advanced version of flow chart that modeling the flow from one activity to another activity. 16 Fig 7 Activity Diagram Fig 7 shows the Activity diagram of our system which helps the model the workflow of a system from one activity to another involving different components or states like initial, final & activity states, etc. It represents the execution of the system. Extracted data is used to preprocess using Python. If our given input data is cleaned, then it is loaded to the Bert preprocessor and encoder. Then the next activity involves loading of the train data to the model. It gives the classified text. So, it is one of the UML diagrams which will make logical representation of a model in which it involves branching, loops, conditions, etc. 17 5.4 Class Diagram : Class diagram is a static diagram. It represents the static view of an application. Class diagram is not only used for visualizing, describing, and documenting different aspects of a system but also for constructing executable code of the software application. Class diagram describes the attributes and operations of a class and also the constraints imposed on the system. The class diagrams are widely used in the modelling of object oriented systems because they are the only UML diagrams, which can be mapped directly with object-oriented languages. Fig 8 Class Diagram 18 Fig 8 shows the Class diagram of our system which will define different classes of the system. It also describes the attributes and operations of a class and also the constraints imposed on the system. It normally shows the static view of the system. 19 5.6 Architecture 20 Fig 9 System Architecture Fig 9 shows the architecture of our system in which it describes all the process, methods, functions and many more which involve in our model from the input to output. The user can give any type of large data or text as input through a microblogs. Then using pandas we read the csv file. The following are the processes which are involved in our algorithm and they are as follows. The given input which is generally a text will be processed in our algorithm. Removal of the unnecessary words and repeated words and noises in the text takes place. Using BERT the text is processed using the MLM and NSP techniques, these will lead to understanding of the sentences. These sentences are encoded based on the techniques used above. BERT layers and the Neural Network Layers are created using the tensorflow library and keras. Finally, we get the required output which is classified information from the given input, it shows the required result to the user. So, these are the main processes involved in our architecture. 6. IMPLEMENTATION 6.1 Procedure: Using the colab and after importing the drive the csv file is uploaded. Then the data is cleaned by droping unnecessary attributes. The cleaned data is then processed to remove the noises in the data. 21 The data is then found to be imbalanced based on the class labels. Number of techniques can be used to balance the data like oversampling, undersampling, SMOTE, clustering etc. We have performed undersampling and the data frame is spit into test and train data. This is done with the help of the sklearn library train_test_split and then required BERT layers are added to the model and layers like dropout and dense layers are added and the model is created. The model is trained on the train data and then evaluated on the test data. Then a confusion matrix is obtained to understand the f1-score, precision, recall of each class label. 6.2 Sample code This is the main code which contains how the summarization process is happened…… from google.colab import drive csv_data = drive.mount('/my-drive') import pandas as pd data = pd.read_csv('/my-drive/MyDrive/Colab Notebooks/MINI PROJECT/tweets.csv') data 22 print(data[data['location']=='arohaonces']) data_new = data.drop(['id','keyword','location'],axis=1) data_new.shape data_new.dtypes import re import string def remove_punct(text): table = str.maketrans('', '', string.punctuation) return text.translate(table) def remove_emoji(text): emoji_pattern = re.compile("[" u"\U0001F600-\U0001F64F" # emoticons u"\U0001F300-\U0001F5FF" # symbols & pictographs u"\U0001F680-\U0001F6FF" # transport & map symbols u"\U0001F1E0-\U0001F1FF" # flags (iOS) u"\U00002702-\U000027B0" u"\U000024C2-\U0001F251" "]+", flags=re.UNICODE) return emoji_pattern.sub(r'', text) 23 def remove_url(text): url = re.compile(r'https?://\S+|www\.\S+') return url.sub(r'',text) def remove_html(text): html=re.compile(r'<.*?>&') return html.sub(r'',text) data_new['text'] = data_new['text'].apply(lambda x : x.lower()) data_new['text'] = data_new['text'].apply(lambda x : remove_url(x)) data_new['text'] = data_new['text'].apply(lambda x : remove_emoji(x)) for col in data_new: 24 if data_new[col].dtypes=='object': print(data_new[col].unique()) print("new") print(data_new.text[11366]) data_new.groupby('target').count() X = data_new.drop('target',axis='columns') y = data_new['target'] X y.value_counts() df_class_1 = data_new[data_new['target']==1] df_class_0 = data_new[data_new['target']==0] df_class_0.shape,df_class_1.shape df_class_0_undersam = df_class_0.sample(df_class_1.shape[0]) df_class_0_undersam.shape,df_class_1.shape df = pd.concat([df_class_0_undersam,df_class_1]) df 25 from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(df['text'],df['target'],stratify = df['target']) X_train.head(4) !pip install tensorflow_text import tensorflow as tf import tensorflow_hub as hub import tensorflow_text as text bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3") bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L12_H-768_A-12/4") 26 def get_sentence_embedding(sentences): preprocessed_text = bert_preprocess(sentences) return bert_encoder(preprocessed_text)['pooled_output'] #Bert Layer text_input = tf.keras.layers.Input(shape=(),dtype=tf.string,name='text') output = get_sentence_embedding(text_input) # Neural Network Layer hidden1 = tf.keras.layers.Dropout(0.1,name='dropout')(output) hidden2 = tf.keras.layers.Dense(1, activation='sigmoid',name = 'hidden')(hidden1) l = tf.keras.layers.Dense(1, activation='sigmoid',name = 'output')(hidden2) # final model model = tf.keras.Model(inputs = [text_input],outputs=[l]) model.summary() 27 METRICS = [ tf.keras.metrics.BinaryAccuracy(name='accuracy'), tf.keras.metrics.Precision(name='precision'), tf.keras.metrics.Recall(name='recall'), ] model.compile(optimizer = 'adam', loss ='binary_crossentropy', metrics = METRICS) 28 model.fit(X_train,y_train,epochs = 20) model.evaluate(X_test,y_test) import numpy as np y_predicted = model.predict(X_test) y_predicted = y_predicted.flatten() y_predicted = np.where(y_predicted > 0.5, 1, 0) from sklearn.metrics import confusion_matrix, classification_report matrix = confusion_matrix(y_test, y_predicted) matrix from matplotlib import pyplot as plt import seaborn as sn sn.heatmap(matrix,annot=True,fmt='d') plt.xlabel('Predicted') plt.ylabel('Truth') print(classification_report(y_test,y_predicted)) 29 7. SOFTWARE TESTING 7.1 Unit Testing: Unit testing is carried out for testing modules constructed from the system design. Each part is compiled using inputs for specific modules. Every module are assembled into a larger unit during the unit testing process. Testing has been performed on each phase of project design and coding. The testing of module interface is carried out to ensure the proper flow of information into and out of the program unit while testing. The temporarily generated output data is ensured that maintains its integrity throughout the algorithm's execution by examining the local data structure. Finally, all error-handling paths are also tested. 7.2 Integration Testing: We usually perform system testing to find errors resulting from unanticipated interaction between the sub-system and system components. Software must be tested to detect and rectify all possible errors once the source code is generated before delivering it to the customers. For finding errors, series of test cases must be developed which ultimately uncover all the possibly existing errors. Different software techniques can be used for this process. These techniques provide systematic guidance for designing test that exercise the internal logic of the software components and exercise the input and output domains of a program to uncover errors in program function, behaviour and performance. We test the software using two methods: White Box testing: Internal program logic is exercised using this test 30 case design techniques. Black Box testing: Software requirements are exercised using this test case design techniques. Both techniques help in finding maximum number of errors with minimal effort and time. 7.3 Acceptance Testing: The testing process is a part of broader subject referring to verification and validation. We must acknowledge the system specifications and try to meet the customer’s requirements and for this sole purpose, we must verify and validate the product to make sure everything is in place. Verification and validation are two different things. One is performed to ensure that the software correctly implements a specific functionality and other is done to ensure if the customer requirements are properly met or not by the product. Verification of the project was carried out to ensure that the project met all the requirement and specification of our project. We made sure that our project is up to the standard as we planned at the beginning of our project development. 31 7.4 Testing on our System: Every model involves coding part and testing part since they are the two things for completing a system successfully. In our code it contains python code for classification of the text. We are using BERT Model that is a one of the techniques for NLP. We have already discussed the methodology used in our project these modules will undergo the unit testing in our system. The model preprocesses the text and trains on the text and classifies it into binary values. BERT is a encoder representation from transformers and is a deep learning algorithm which can take input and assigns importance (biases and weights) to better learn the flow of text. The pre-trained BERT Preprocessor is used from the tensorflow to learn the features/characteristics of the sentences to the utmost. The confusion matrix provides with the information on the amount of data correctly predicted and the ones that aren’t correctly predicted. 32 8. RESULTS This interface contains a input option and a submit button. Anyone can give any Wikipedia URL and kick the submit button. Input: Sentences: Tribhuvan International Airport bans landing of big aircraft. They've gone ambulance chasing in Australia You must be annhilated! Arsonist sets cars ablaze at dealership A other Hard hit Stones were pelted on Muslims' houses and some houses and vehicles were set ablaze. Output: Fact Non-Factual Non-Factual Fact Non-Factual Fact 33 34 9. CONCLUSION AND FUTURE ENHANCEMENTS 9.1 Conclusion This work makes available to the community a novel test collection for evaluating microblog retrieval strategies for practical information needs in a disaster situation. Information retrieval is thus important to avoid any false claims and avoid rumors. We believe that the contributed test collection will help the research community to develop better models for microblog retrieval in future. 35 9.2 Future enhancements In this project, NLP takes a very important role in new machine human interface. When we look at some of the products based on the technologies of it, we can see that they are very much advanced but very useful as well. The researchers from Carnegie Mellon University and Google have developed a new model, XLNet, for natural language processing (NLP) tasks such as reading comprehension, text classification, sentiment analysis, and others. This model outperforms BERT. Otherwise, the performance of BERT can also be increased using the BERT tokenizer. The challenge can be extended to retrieve relevant microblogs from the live streaming of microblogs dynamically. We plan to explore this direction in the coming years. 36 10. BIBLIOGRAPHY [1] Imran, M., Castillo, C., Diaz, F., Vieweg, S.: Processing Social Media Messages in Mass Emergency: A Survey. ACM Computing Surveys 47(4), 67:1– 67:38 (Jun 2015) [2] Feldman, R. & Sanger, J., 2007. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. United States of America: s.n. [3] Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge ¨ University Press, New York, NY, USA (2008) [4] Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proc. ACM SIGIR. pp. 275–281 (1998) [5] Sunni, I. & Widyantoro, D. H., 2012. Analisis Sentimen dan Ekstraksi Topik Penentu. Jurnal Sarjana Institut Teknologi Bandung Bidang Teknik Elektro dan Informatika, 1(2). [6] https://www.codemotion.com/magazine/dev-hub/machine-learning- dev/bert-how-google-changed-nlp-and-how-to-benefit-from-this/ [7] https://medium.com/analytics-vidhya/evolution-of-nlp-part-4- transformers-bert-xlnet-robertabd13b2371125#:~:text=XLNet%20has%20a%20similar%20architecture,be%20pre dicted%20by%20the%20model. [8] M. Basu, K. Ghosh, S. Das, R. Dey, S. Bandyopadhyay, and S. Ghosh. 2017. Identifying Post-Disaster Resource Needs and Availabilities from Microblogs. In Proc. ASONAM [9] Basu M, Ghosh S, Ghosh K, Choudhury M. Overview of the fire 2017 track: Information retrieval from microblogs during disasters (IRMiDis). In: 37 Working notes of FIRE 2017: forum for information retrieval evaluation (CEUR Workshop Proceedings). FIRE’17; 2017a. Vol 2036. pp. 28–33. http://ceurws.org/Vol-2036/T2-1.pdf. [10] Basu M, Ghosh K, Das S, Bandyopadhyay S, Ghosh S. Microblog retrieval during disasters: Comparative evaluation of IR methodologies. In: Text processing—FIRE 2016 International Workshop, Kolkata, India, December 7–10, 2016, Revised Selected Papers; 2016. pp. 20–38. https://doi.org/10.1007/978-3-31973606-8_2. [11] https://medium.com/analytics-vidhya/understanding-bert-architecture- 3f35a264b187 [12] AIDR - https://irevolutions.org/ Artificial Intelligence for Disaster Response, 2013/10/01/aidr-artificial-intelligence-for-disaster- response/ [13] Lin, J., Efron, M., Wang, Y., Sherman, G., Voorhees, E.: Overview of the TREC-2015 Microblog Track (2015) [14] Strohman, T., Metzler, D., Turtle, H., Croft, W.B.: Indri: A language model-based search engine for complex queries. In: Proc. ICIA. Available at: http://www.lemurproject. org/indri/ (2004) [15] Cleverdon, C.: The cranfield tests on index language devices. In: Sparck Jones, K., Willett, P. (eds.) Readings in Information Retrieval, pp. 47–59. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1997) 38