Uploaded by TULLURI SAI RAJEEV

Group 9 Final Documentation

advertisement
A PROJECT REPORT ON
INFORMATION RETRIEVAL FROM
MICROBLOGS DURING DISASTERS
Mini project submitted in partial fulfillment of the requirements for the
award of the degree of
BACHELOR OF TECHNOLOGY
IN
INFORMATION TECHNOLOGY
(2019-2023)
BY
S. SOURAV CHOUDHARY
T. SAI RAJEEV
A. NIKHIL
CHUKKA SACHIN
19241A12A7
19241A12B4
19241A1261
19241A1272
Under the Esteemed guidance
of
K.SWAPNIKA,
Assistant Professor
Dept of IT.
i
DEPARTMENT OF INFORMATION TECHNOLOGY
GOKARAJU RANGARAJU INSTITUTE OF ENGINEERING AND
TECHNOLOGY
(AUTONOMOUS)
HYDERABAD
CERTIFICATE
This is to certify that it is a bonafide record of Mini Project work entitled
“INFORMATION RETRIEVAL FROM MICROBLOGS DURING DISASTERS”
done by S.SOURAV CHOUDHARY(19241A12A7),T.SAI RAJEEV(19241A12B4), A.
NIKHIL(19241A1261), CHUKKA SACHIN(19241A1272), of B.Tech (IT) in the
Department of Information Technology, Gokaraju Rangaraju Institute of
Engineering and Technology during the period 2019-2023 in the partial
fulfillment of the requirements for the award of degree of BACHELOR OF
TECHNOLOGY IN INFORMATION TECHNOLOGY from GRIET, Hyderabad.
ii
K. SWAPNIKA,
Dr.N.V.GANAPATI RAJU,
Assistant Professor
Head of the Department,
(Internal project guide)
(Project External) .
ACKNOWLEDGEMENT
We take the immense pleasure in expressing gratitude to our Internal guide,
K. SWAPNIKA, Assistant Prof, DEPT of IT, GRIET. We express our sincere
thanks for her encouragement, suggestions and support, which provided the
impetus and paved the way for the successful completion of the project work.
We wish to express our gratitude to Dr.N.V.GANAPATI RAJU, our
Project Co-coordinators K.Swanthana, for their constant support during the
project.
We express our sincere thanks to Dr. Jandhyala N Murthy, Director,
GRIET, and Dr. J. Praveen, Principal, GRIET, for providing us the conductive
environment for carrying through our academic schedules and project with ease.
We also take this opportunity to convey our sincere thanks to the teaching
and non-teaching staff of GRIET College, Hyderabad.
iii
Email:souravc492@gmail.com
Email. nikhiladdla@gmail.com
Contact No:7013154142
Contact No: 6281474767
Address: Ameerpet, Hyderabad.
Address: Ameerpet, Hyderabad
Email: sairajeev1234@gmail.com
Contact No: 8639799683
Address: Nizampet, Hyderabad.
Email: sachinchukka02@gmail.com
Contact No: 9347680442
Address: Lingampally, Hyderabad
iv
DECLARATION
This is to certify that the project entitled “INFORMATION
RETRIEVAL FROM MICROBLOGS DURING DISASTER” is a
bonafide work done by us in partial fulfillment of the requirements for the
award
of
the
degree
BACHELOR
OF
TECHNOLOGY
IN
INFORMATION TECHNOLOGY from Gokaraju Rangaraju Institute of
Engineering and Technology, Hyderabad.
We also declare that this project is a result of our own effort and has not been
copied or imitated from any source. Citations from any websites, books and
paper publications are mentioned in the Bibliography.
This work was not submitted earlier at any other University or Institute for
the award of any degree.
S.SOURAV CHOUDHARY 19241A12A7
T.SAI RAJEEV
19241A12B4
A.NIKHIL
19241A1261
CHUKKA SACHIN
19241A1272
v
TABLE OF CONTENTS
Serial no
Name
Page no
Certificates
ii
Contents
v
Abstract
viii
1
INTRODUCTION
1
1.1
Introduction to project
1
1.2
Existing System
1
1.3
Proposed System
2
2
REQUIREMENT ENGINEERING
3
2.1
Hardware Requirements
3
2.2
Software Requirements
3
3
LITERATURE SURVEY
4
4
TECHNOLOGY
6
5
DESIGN REQUIREMENT ENGINEERING
13
5.1
UML Diagrams
13
5.2
Use-Case Diagram
13
5.3
Sequence Diagram
15
5.4
Activity Diagram
16
5.5
Class Diagram
18
5.6
Architecture
20
6
IMPLEMENTATION
21
6.1
Procedure
21
vi
6.2
Sample Code
22
7
SOFTWARE TESTING
30
7.1
Unit Testing
30
7.2
Integration Testing
30
7.3
Acceptance Testing
31
7.4
Testing on our system
32
8
RESULTS
33
9
CONCLUSION AND FUTURE ENHANCEMENTS
35
10
BIBLIOGRAPHY
37
vii
11. LIST OF FIGURES
S No
Figure Name
Page no
1
Text Classifications
2
2
Classification of Text
9
3
Transformer-model Architecture
11
4
Encoders in a BERT
12
5
Use Case Diagram
14
6
Sequence Diagram
15
7
Activity Diagram
17
8
Class Diagram
18
9
Architecture
20
ABSTRACT
In last few years, microblogging sites like Twitter have been evolved as a
repository of critical situational information during various mass emergencies.
However, messages posted on microblogging sites often contain non-actionable
information such as sympathy and prayer for victims.
viii
Moreover, messages sometimes contain rumours and overstated facts. In such
situations, identification of tweets that report some relevant and actionable
information is extremely important for effective coordination of post-disaster relief
operations. Thus, efficient IR methodologies are required to identify such critical
information. Additionally, cross-verification of such critical information is a
practical necessity to ensure the trustworthiness. Our present study provides a brief
description of the tasks (research problem) given in these tracks, a summary of
methodologies of all submitted runs and finally a brief description of our proposed
methodologies to address the research problems of IRMidis track.
Domain : DEEP learning, Natural language processing, BERT
ix
1. INTRODUCTION
1.1 Introduction to Project
Microblogging sites like Twitter are increasingly being used for aiding relief
operations during various mass emergencies.
A lot of critical situational information is posted on microblogging sites during
disaster events. However, messages posted on microblogging sites often contain
rumors and overstated facts.
In such situations, identification of claims or fact-checkable tweets, i.e., tweets that
report some relevant and verifiable fact (other than sympathy or prayer) is extremely
important for effective coordination of post-disaster relief operations.
1.2 Existing Systems
There has been lot of recent interest in addressing various challenges on
microblogs posted during disaster events, such as classification, summarization,
event detection, and so on . Some datasets of social media posts during disasters
have also been developed, but they are primarily meant for evaluating methodologies
for classification among different types of posts (and not for retrieval
methodologies). Few methodologies for retrieving specific types of microblogs have
also been proposed, such as tweets asking for help, and tweets reporting
infrastructure damage . However, all such studies have used different datasets. To
our knowledge, there is no standard test collection for evaluating strategies for
1
microblog retrieval in a disaster scenario; this work attempts to develop such a test
collection. Many existing techniques are available for text processing like
word2vector, SVM classifier, POS tagging, word embeddings, CBOW model.
1.3 Proposed System
The proposed system is followed by tweets and then Cleaning of data (words,
sentences), removing Stopwords tasks are applied.
A model is created with the help of pre-trained BERT model.
Custom Activation function and Loss Function are provided.
Tweets
Sentence1
Sentence2
Sentence 3
Fact
Text Classification
Non-fact
Fig 1 Text Classification
Fig 1 shows that our system working is based on the binary classification of the text.
We are identifying important phrases or sentences from the original text preform
encoding so that it’s better used by the models.
2
2. REQUIREMENT ENGINEERING
2.1 Hardware Requirements
 Processor – i3 and above (64-bit OS).
 Memory – 4GB RAM (Higher specs are recommended for high performance)
 Input devices – Keyboard, Mouse
2.2 Software Requirements
 Windows/Mac
 Visual studio code
 Python3
 Tensorflow,.Keras
 Sklearn, pandas , numpy libraries
3
3. LITERATURE SURVEY
During late 1950 text processing using NLP was introduced and before that time it
is based on the statistical methods which is published in the year 1958. Those
methods mainly involved selecting large blocks of text for generating relative and
rational abstracts for the same. Furthermore, this work progressed to better and more
persuasive results with the help of graph based ranking model for text processing
and with Maximal Marginal Relevance (MMR) criterion as detailed. Meanwhile,
evaluation measures like Bilingual Evaluation Understudy (BLEU) were invented
that determined how well an automatic summary covered the matter, present in an
original text. They also differentiated datasets like the DUC series, Medline, TAC
series and many other were developed so that comparison and contrasting of various
summarization methods would be possible for any kind of large text.
One very broad and highly active field of research in AI (artificial
intelligence) is NLP: Natural Language Processing. Scientists have been trying to
teach machines how to understand and even write natural languages (such as English
or Chinese) since the very beginning of computer science and artificial intelligence.
One of the founding fathers of artificial intelligence, Alan Turing, suggested this as
a possible application for the “learning machines” he imagined as early as the late
1940s. Other pioneers, such as Claude Shannon, who founded the mathematical
theory of information and communication, have also suggested natural languages as
a playground for the application of information technology and computer science.
4
The world has moved on since the days of these early pioneers, and today we
use NLP solutions without even realizing it. We live in the world Turing dreamt of,
but are scarcely aware of doing so!
The history of NLP is long and complex, involving several techniques once
considered state of the art that now are barely remembered. Certain turning points in
this history changed the field forever, and focused the attention of thousands
of researchers on a single path forward. In recent years, the resources required
to experiment and forge new paths in NLP have largely only been available out with
academia. Such resources are most available to private hi-tech companies: hardware
and large groups of researchers are more easily allocated to a particular task
by Google, Facebook and Amazon than by the average university, even in the
United
States.
Consequently,
more
and
more
new ideas arise
out
of
big companies rather than universities. In NLP, at least two such ideas followed this
pattern: the word2vec and BERT algorithms.
The former is a word embedding algorithm devised by Tomas Mikolov and others
in 2013. Another important class of algorithms – BERT – was published
by Google researchers in 2018. Within just a few months these algorithms replaced
previous NLP algorithms in the Google Search Engine. In both cases, the
researchers released their solutions as open source, disclosing results, datasets and
of course, the full code.
Such rapid progress and impact on widely-used products is amazing and worthy of
deeper analysis. This article will offer hints for developers who wish to play with
this new tool.
5
4. TECHNOLOGY
4.1 ABOUT PYTHON
Python is powerful and fast, plays well with others, is user friendly and easy to learn,
and is open source. It is an all-around valuable programming language used in
Dialog flow. It is used as a base for the most prominent Abased programming in
light of its versatility, straightforwardness and longstanding reputation. Python is an
interpreter, high-level, general-purpose programming language.
Python is an interpreted, object-oriented, high-level programming language
with dynamic semantics. Its high-level built in data structures, combined with
dynamic typing and dynamic binding, make it very attractive for Rapid Application
Development, as well as for use as a scripting or glue language to connect existing
components together. Python's simple, easy to learn syntax emphasizes readability
and therefore reduces the cost of program maintenance. Python supports modules
and packages, which encourages program modularity and code reuse. Often,
programmers fall in love with Python because of the increased productivity it
provides. Since there is no compilation step, the edit-test-debug cycle is incredibly
fast. Debugging Python programs is easy: a bug or bad input will never cause a
segmentation fault. Instead, when the interpreter discovers an error, it raises an
exception. When the program doesn't catch the exception, the interpreter prints a
stack trace. A source level debugger allows inspection of local and global variables,
evaluation of arbitrary expressions, setting breakpoints, stepping through the code a
line at a time, and so on. The debugger is written in Python itself, testifying to
6
Python's introspective power. On the other hand, often the quickest way to debug a
program is to add a few print statements to the source: the fast edit-test-debug cycle
makes this simple approach very effective.
4.2 APPLICATIONS OF PYTHON
One significant advantage of learning Python is that it’s a general-purpose
language that can be applied in a large variety of projects. Below are just some of
the most common fields where Python has found its use:
 Data science
 Scientific and mathematical computing
 Web development
 Computer graphics
 Basic game development
 Mapping and geography (GIS software)
4.3 PYTHON IS WIDELY USED IN DATA SCIENCE
Python’s ecosystem is growing over the years and it’s more and more capable of the
statistical analysis.
It’s the best compromise between scale and sophistication (in terms OD data
processing).
Python emphasizes productivity and readability.
Python is used by programmers that want to delve into data analysis or apply
statistical techniques (and by devs that turn to data science).
There are plenty of Python scientific packages for data visualization, machine
7
learning, natural language processing, complex data analysis and more. All of these
factors make Python a great tool for scientific computing and a solid alternative for
commercial packages such as MatLab. The most popular libraries and tools for data
science are:
4.3.1 PANDAS
Pandas name is derived from the term “panel data”, an econometrics term for
multidimensional structured data sets. It is a library for data manipulation and
analysis. The library provides data structures and operations for manipulating
numerical tables and time series. It is also known as “Python Data Analysis Library”
4.3.2 NUMPY
NumPy is a general-purpose array-processing package. It provides a highperformance multidimensional array object, and tools for working with these
arrays.This is a fundamental package for scientific computing with Python, adding
support for large, multi-dimensional arrays and matrices, along with a large library
of high-level mathematical functions to operate on these arrays.
4.3.3 MATPLOTLIB
Matplotlib is a python 2D plotting library which produces publication quality figures
in a variety of hardcopy formats and interactive environments across platforms.
Matplotlib allows you to generate plots, histograms, power spectra, bar charts, error
charts, scatterplots, and more.
4.3.4 SCIKIT-LEARN
Scikit-learn are a machine learning library. It features various classification,
8
regression and clustering algorithms including support vector machines, random
forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate
with the Python numerical and scientific libraries NumPy and SciPy.
4.3.5 CSV
The so-called CSV (Comma Separated Values) format is the most common import
and export format for spreadsheets and databases. The csv module implements
classes to read and write tabular data in CSV format. It allows programmers to say,
“write this data in the format preferred by Excel,” or “read data from this file which
was generated by Excel,” without knowing the precise details of the CSV format
used by Excel. Programmers can also describe the CSV formats understood by other
applications or define their own special-purpose CSV format.
4.4 Dataset Description
The data contains around 10,000 microblogs (tweets) from Twitter that were posted
during the Disasters from the year 2015-2017 from IRMiDis FIRE2021.
Along with the dataset, sample of few claims or fact-checkable tweets and non factcheckable tweets are present.
The dataset is a text files, in the following format:
Tweetid <||>Location
<||>Keywords <||> Tweettext <||> Target.
Example of claim or fact-checkable:
ibnlive <||> Nepal earthquake:
Tribhuvan International Airport bans landing of big aircraft.
9
Text from
Microblogs
Factual
Non-Factual
Fig 2 Classification of Text
Fig 2 shows that there are two kinds of text classification namely Factual and NonFactual.
4.5 Working of a BERT Model
BERT is a transformer-based architecture. Transformer uses a self attention
mechanism, which is suitable for language understanding. The transformer has
an encoder-decoder architecture. They are composed of modules that contain
feed-forward and attention layers.
10
Fig 3 Transformer-model Architecture
In order to have a deeper sense of language context, BERT uses bidirectional
training. Sometimes, it’s also referred to as “non-directional”. So, it takes both the
previous
and
next
tokens
into
account
simultaneously. BERT
applies
the bidirectional training of Transformer to language modeling, learns the text
11
representations. BERT is just an encoder. It does not have a decoder.
BERT is a multi-layered encoder. In that paper, two models were introduced, BERT
base and BERT large. The BERT large has double the layers compared to the base
model. By layers, we indicate transformer blocks. BERT-base was trained on 4
cloud-based TPUs for 4 days and BERT-large was trained on 16 TPUs for 4 days.
Fig 4 Encoders in a BERT
As you can see from the above image, the BERT base is a stack of 12 encoders. Each
of them is a transformer block. The input has to be provided to the first encoder. The
BERT encoder expects a sequence of tokens.
It uses two methods: MLM (Masked LM) and NSP (Next Sentence Prediction).
MLM (Masked Language Modelling):
Some percentage of words are randomly masked by replacing the words with the
token [MASK]. It is then trained to predict these masked words using the context
form the remaining words.
Next Sentence Prediction (NSP):
To understand the relationship between two sentences, BERT uses NSP training.
The model receives pairs of sentences as input, and it is trained to predict if the
second sentence is the next sentence to the first or not. During training, we
12
provide 50-50 inputs of both cases. The assumption is that the random sentence will
be disconnected from the first sentence in contextual meaning.
5. DESIGN REQUIREMENT ENGINEERING
Concept of uml :
UML is a standard language for specifying, visualizing, constructing, and
documenting the artifacts of software systems.ML stands for Unified Modeling
Language.UML is different from the other common programming languages such
as C++, Java, COBOL, etc.UML is a pictorial language used to make software
blueprints.There are a number of goals for developing UML but the most important
is to define some general purpose modeling language, which all modelers can use
and it also needs to be made simple to understand and use.
UML DIAGRAMS:
5.1 Use case Diagram:
A use case diagram in the Unified Modeling language (UML) is a type of behavioral
diagram defined by and created from a Use-case analysis. Its purpose is to present a
graphical overview of the functionality provided by a system in terms of actors, their
goals (represented as use case), and any dependencies between those cases. The main
purpose of a use case diagram is to show system functions are performed for which
13
actor. Roles of the actors in the system can be depicted.
Fig 5 Use Case Diagram
Fig 5 shows the use case diagram of our system which describes the
interaction between actors which are the one will interact with the subjects. In our
project there are mainly two actors involved in it namely User and System, i.e.,the
Model. In this diagram we will going to see the interaction involved to them.
14
Initially, User tweets in which our we take that data from twitter. Then System
collects data and performs necessary processes. Finally, the Model tool display
result, then user can view it which is our required output. So, these diagram helps
our model the interaction between the system and the user.
5.2 Sequence Diagram:
A sequence diagram in UML is a kind of interaction diagram which shows
how each process of the system operates with one another and in what order. It is a
constructed as a message sequence chart. Sequence diagrams are sometimes called
event diagrams or timing diagrams.
15
Fig 6 Sequence Diagram
Fig 6 shows the Sequence diagram of our system which is an interaction
diagram in which some sequence of information flow from one object to another
object in a specified order to represent the time order of a process. It aims at a specific
functionality of a model.
There are mainly five timelines in our project, they are as follows namely
User, MicroBlog, PreProcessor ,BERT and Model. The information will flow from
a timeline from one by one. Initially, User tweets on a microblog site as input and
the data is taken from the microblog(twitter in this case). With the help of
preprocessing, data is cleaned, if it is ok, then it continues further. Then BERT
encodes the text and performs NSP,MLM. Then the model will perform its process
in order to get binary classification. So these diagram represents the sequence of our
information flowing from one object to another object.
5.3 Activity diagram:
Activity diagram is another important behavioral diagram in uml diagram to
describe dynamic aspects of the system. Activity diagram is essentially an
advanced version of flow chart that modeling the flow from one activity to
another activity.
16
Fig 7 Activity Diagram
Fig 7 shows the Activity diagram of our system which helps the model the workflow
of a system from one activity to another involving different components or states
like initial, final & activity states, etc. It represents the execution of the system.
Extracted data is used to preprocess using Python. If our given input data is
cleaned, then it is loaded to the Bert preprocessor and encoder. Then the next activity
involves loading of the train data to the model. It gives the classified text. So, it is
one of the UML diagrams which will make logical representation of a model in
which it involves branching, loops, conditions, etc.
17
5.4 Class Diagram :
Class diagram is a static diagram. It represents the static view of an application.
Class diagram is not only used for visualizing, describing, and documenting
different aspects of a system but also for constructing executable code of the
software application.
Class diagram describes the attributes and operations of a class and also the
constraints imposed on the system. The class diagrams are widely used in the
modelling of object oriented systems because they are the only UML diagrams,
which can be mapped directly with object-oriented languages.
Fig 8 Class Diagram
18
Fig 8 shows the Class diagram of our system which will define different classes of
the system. It also describes the attributes and operations of a class and also the
constraints imposed on the system. It normally shows the static view of the system.
19
5.6 Architecture
20
Fig 9 System Architecture
Fig 9 shows the architecture of our system in which it describes all the process,
methods, functions and many more which involve in our model from the input to
output. The user can give any type of large data or text as input through a microblogs.
Then using pandas we read the csv file. The following are the processes which are
involved in our algorithm and they are as follows.
The given input which is generally a text will be processed in our algorithm.
Removal of the unnecessary words and repeated words and noises in the text takes
place. Using BERT the text is processed using the MLM and NSP techniques, these
will lead to understanding of the sentences. These sentences are encoded based on
the techniques used above. BERT layers and the Neural Network Layers are created
using the tensorflow library and keras. Finally, we get the required output which is
classified information from the given input, it shows the required result to the user.
So, these are the main processes involved in our architecture.
6. IMPLEMENTATION
6.1 Procedure:
Using the colab and after importing the drive the csv file is uploaded.
Then the data is cleaned by droping unnecessary attributes. The cleaned data is then
processed to remove the noises in the data.
21
The data is then found to be imbalanced based on the class labels. Number of
techniques can be used to balance the data like oversampling, undersampling,
SMOTE, clustering etc. We have performed undersampling and the data frame is
spit into test and train data.
This is done with the help of the sklearn library train_test_split and then required
BERT layers are added to the model and layers like dropout and dense layers are
added and the model is created.
The model is trained on the train data and then evaluated on the test data.
Then a confusion matrix is obtained to understand the f1-score, precision, recall of
each class label.
6.2 Sample code
This is the main code which contains how the summarization process is
happened……
from google.colab import drive
csv_data = drive.mount('/my-drive')
import pandas as pd
data
=
pd.read_csv('/my-drive/MyDrive/Colab
Notebooks/MINI
PROJECT/tweets.csv')
data
22
print(data[data['location']=='arohaonces'])
data_new = data.drop(['id','keyword','location'],axis=1)
data_new.shape
data_new.dtypes
import re
import string
def remove_punct(text):
table = str.maketrans('', '', string.punctuation)
return text.translate(table)
def remove_emoji(text):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', text)
23
def remove_url(text):
url = re.compile(r'https?://\S+|www\.\S+')
return url.sub(r'',text)
def remove_html(text):
html=re.compile(r'<.*?>&')
return html.sub(r'',text)
data_new['text'] = data_new['text'].apply(lambda x : x.lower())
data_new['text'] = data_new['text'].apply(lambda x : remove_url(x))
data_new['text'] = data_new['text'].apply(lambda x : remove_emoji(x))
for col in data_new:
24
if data_new[col].dtypes=='object':
print(data_new[col].unique())
print("new")
print(data_new.text[11366])
data_new.groupby('target').count()
X = data_new.drop('target',axis='columns')
y = data_new['target']
X
y.value_counts()
df_class_1 = data_new[data_new['target']==1]
df_class_0 = data_new[data_new['target']==0]
df_class_0.shape,df_class_1.shape
df_class_0_undersam = df_class_0.sample(df_class_1.shape[0])
df_class_0_undersam.shape,df_class_1.shape
df = pd.concat([df_class_0_undersam,df_class_1])
df
25
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['text'],df['target'],stratify =
df['target'])
X_train.head(4)
!pip install tensorflow_text
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
bert_preprocess
=
hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L12_H-768_A-12/4")
26
def get_sentence_embedding(sentences):
preprocessed_text = bert_preprocess(sentences)
return bert_encoder(preprocessed_text)['pooled_output']
#Bert Layer
text_input = tf.keras.layers.Input(shape=(),dtype=tf.string,name='text')
output = get_sentence_embedding(text_input)
# Neural Network Layer
hidden1 = tf.keras.layers.Dropout(0.1,name='dropout')(output)
hidden2 = tf.keras.layers.Dense(1, activation='sigmoid',name = 'hidden')(hidden1)
l = tf.keras.layers.Dense(1, activation='sigmoid',name = 'output')(hidden2)
# final model
model = tf.keras.Model(inputs = [text_input],outputs=[l])
model.summary()
27
METRICS = [
tf.keras.metrics.BinaryAccuracy(name='accuracy'),
tf.keras.metrics.Precision(name='precision'),
tf.keras.metrics.Recall(name='recall'),
]
model.compile(optimizer = 'adam',
loss ='binary_crossentropy',
metrics = METRICS)
28
model.fit(X_train,y_train,epochs = 20)
model.evaluate(X_test,y_test)
import numpy as np
y_predicted = model.predict(X_test)
y_predicted = y_predicted.flatten()
y_predicted = np.where(y_predicted > 0.5, 1, 0)
from sklearn.metrics import confusion_matrix, classification_report
matrix = confusion_matrix(y_test, y_predicted)
matrix
from matplotlib import pyplot as plt
import seaborn as sn
sn.heatmap(matrix,annot=True,fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Truth')
print(classification_report(y_test,y_predicted))
29
7. SOFTWARE TESTING
7.1 Unit Testing:
Unit testing is carried out for testing modules constructed from the system
design. Each part is compiled using inputs for specific modules. Every module are
assembled into a larger unit during the unit testing process. Testing has been
performed on each phase of project design and coding. The testing of module
interface is carried out to ensure the proper flow of information into and out of the
program unit while testing. The temporarily generated output data is ensured that
maintains its integrity throughout the algorithm's execution by examining the local
data structure. Finally, all error-handling paths are also tested.
7.2 Integration Testing:
We usually perform system testing to find errors resulting from unanticipated
interaction between the sub-system and system components. Software must be tested
to detect and rectify all possible errors once the source code is generated before
delivering it to the customers. For finding errors, series of test cases must be
developed which ultimately uncover all the possibly existing errors. Different
software techniques can be used for this process. These techniques provide
systematic guidance for designing test that exercise the internal logic of the software
components and exercise the input and output domains of a program to uncover
errors in program function, behaviour and performance. We test the software using
two methods: White Box testing: Internal program logic is exercised using this test
30
case design techniques. Black Box testing: Software requirements are exercised
using this test case design techniques. Both techniques help in finding maximum
number of errors with minimal effort and time.
7.3 Acceptance Testing:
The testing process is a part of broader subject referring to verification and
validation. We must acknowledge the system specifications and try to meet the
customer’s requirements and for this sole purpose, we must verify and validate the
product to make sure everything is in place. Verification and validation are two
different things. One is performed to ensure that the software correctly implements
a specific functionality and other is done to ensure if the customer requirements are
properly met or not by the product. Verification of the project was carried out to
ensure that the project met all the requirement and specification of our project. We
made sure that our project is up to the standard as we planned at the beginning of
our project development.
31
7.4 Testing on our System:
Every model involves coding part and testing part since they are the two things
for completing a system successfully. In our code it contains python code for
classification of the text. We are using BERT Model that is a one of the techniques
for NLP. We have already discussed the methodology used in our project these
modules will undergo the unit testing in our system.
The model preprocesses the text and trains on the text and classifies it into
binary values. BERT is a encoder representation from transformers and is a deep
learning algorithm which can take input and assigns importance (biases and weights)
to better learn the flow of text. The pre-trained BERT Preprocessor is used from the
tensorflow to learn the features/characteristics of the sentences to the utmost.
The confusion matrix provides with the information on the amount of data
correctly predicted and the ones that aren’t correctly predicted.
32
8. RESULTS
This interface contains a input option and a submit button. Anyone can give any
Wikipedia URL and kick the submit button.
Input:
Sentences:
Tribhuvan International Airport bans landing of big aircraft.
They've gone ambulance chasing in Australia
You must be annhilated!
Arsonist sets cars ablaze at dealership
A other Hard hit
Stones were pelted on Muslims' houses and some houses and vehicles were set
ablaze.
Output:
Fact
Non-Factual
Non-Factual
Fact
Non-Factual
Fact
33
34
9. CONCLUSION AND FUTURE ENHANCEMENTS
9.1 Conclusion
This work makes available to the community a novel test collection for evaluating
microblog retrieval strategies for practical information needs in a disaster situation.
Information retrieval is thus important to avoid any false claims and avoid rumors.
We believe that the contributed test collection will help the research community to
develop better models for microblog retrieval in future.
35
9.2 Future enhancements
In this project, NLP takes a very important role in new machine human interface.
When we look at some of the products based on the technologies of it, we can see
that they are very much advanced but very useful as well. The researchers from
Carnegie Mellon University and Google have developed a new model, XLNet, for
natural language processing (NLP) tasks such as reading comprehension, text
classification, sentiment analysis, and others. This model outperforms BERT.
Otherwise, the performance of BERT can also be increased using the BERT
tokenizer. The challenge can be extended to retrieve relevant microblogs from the
live streaming of microblogs dynamically. We plan to explore this direction in the
coming years.
36
10. BIBLIOGRAPHY
[1]
Imran, M., Castillo, C., Diaz, F., Vieweg, S.: Processing Social Media
Messages in Mass Emergency: A Survey. ACM Computing Surveys 47(4), 67:1–
67:38 (Jun 2015)
[2]
Feldman, R. & Sanger, J., 2007. The Text Mining Handbook:
Advanced Approaches in Analyzing Unstructured Data. United States of America:
s.n.
[3]
Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information
Retrieval. Cambridge ¨ University Press, New York, NY, USA (2008)
[4]
Ponte, J.M., Croft, W.B.: A language modeling approach to information
retrieval. In: Proc. ACM SIGIR. pp. 275–281 (1998)
[5]
Sunni, I. & Widyantoro, D. H., 2012. Analisis Sentimen dan Ekstraksi
Topik Penentu. Jurnal Sarjana Institut Teknologi Bandung Bidang Teknik Elektro
dan Informatika, 1(2).
[6]
https://www.codemotion.com/magazine/dev-hub/machine-learning-
dev/bert-how-google-changed-nlp-and-how-to-benefit-from-this/
[7]
https://medium.com/analytics-vidhya/evolution-of-nlp-part-4-
transformers-bert-xlnet-robertabd13b2371125#:~:text=XLNet%20has%20a%20similar%20architecture,be%20pre
dicted%20by%20the%20model.
[8]
M. Basu, K. Ghosh, S. Das, R. Dey, S. Bandyopadhyay, and S. Ghosh.
2017. Identifying Post-Disaster Resource Needs and Availabilities from Microblogs.
In Proc. ASONAM
[9]
Basu M, Ghosh S, Ghosh K, Choudhury M. Overview of the fire 2017
track: Information retrieval from microblogs during disasters (IRMiDis). In:
37
Working notes of FIRE 2017: forum for information retrieval evaluation (CEUR
Workshop Proceedings). FIRE’17; 2017a. Vol 2036. pp. 28–33. http://ceurws.org/Vol-2036/T2-1.pdf.
[10]
Basu M, Ghosh K, Das S, Bandyopadhyay S, Ghosh S. Microblog
retrieval during disasters: Comparative evaluation of IR methodologies. In: Text
processing—FIRE 2016 International Workshop, Kolkata, India, December 7–10,
2016, Revised Selected Papers; 2016. pp. 20–38. https://doi.org/10.1007/978-3-31973606-8_2.
[11]
https://medium.com/analytics-vidhya/understanding-bert-architecture-
3f35a264b187
[12]
AIDR
-
https://irevolutions.org/
Artificial
Intelligence
for
Disaster
Response,
2013/10/01/aidr-artificial-intelligence-for-disaster-
response/
[13]
Lin, J., Efron, M., Wang, Y., Sherman, G., Voorhees, E.: Overview of
the TREC-2015 Microblog Track (2015)
[14]
Strohman, T., Metzler, D., Turtle, H., Croft, W.B.: Indri: A language
model-based search engine for complex queries. In: Proc. ICIA. Available at:
http://www.lemurproject. org/indri/ (2004)
[15]
Cleverdon, C.: The cranfield tests on index language devices. In:
Sparck Jones, K., Willett, P. (eds.) Readings in Information Retrieval, pp. 47–59.
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1997)
38
Download