Naive Bayes Twitter Sentiment Analysis of BPJS Health

Naive Bayes Classifier on Twitter Sentiment
Analysis BPJS of HEALTH
Sepyan Purnama Kristanto, Junaedi Adi Prasetyo
Department of Informatics Engineering
Politeknik Negeri Banyuwangi
Banyuwangi, Indonesia
sepyan@poliwangi.ac.id, junaedi.prasetyo@poliwangi.ac.id
Abstract— Public health insurance is one indicator of the
success of the government's active role in managing and
facilitating its citizens. Health media and excellent facilities
undoubtedly read a positive impact on the development of
society, especially at this time. BPJS, as a government health
media for the people of Indonesia, of course, must bring
change and be a solution to the imbalance of health services for
small and medium people. Sentiment analysis of BPJS
products is one solution to get information on the active role of
the community as the primary user of their health products.
Sentiment analysis is carried out by utilizing social media as
the primary basis for data collection. In this study, the initial
stage taken was data collection and continued to do Post
Tagging on community tweet data. Furthermore, these data
are classified again using the Naïve Bayes model to obtain
optimal results. The results of the study note that BPJS health
services get an accuracy rate of 70% negative for payment
topics and 72% positive for information topics, and get a 65%
score likely from users in using BPJS services as their health
Keywords—twitter sentiment analyst, naïve Bayes, bpjs of
Health is one of the main points of service and
government success in providing its policies [1]. Equitable
distribution of healthcare brings positive impacts on the
development and community economy of all levels. BPJS is
one form of government commitment to facilitate its people
in the form of state-owned enterprises and sheltered by the
state as the parent of each service. Many good hopes of
society and government so that BPJS can be a solution of
health gaps that many of us know from the site seconds that
the service and drug costs are often soaring and not reachable
for some components of the community.
We know that social security in health or health social
insurance is further regulated in Article 19 paragraph (2) of
UUSJSN which determines that the health insurance is held
to guarantee For participants to obtain health care and
protection benefits in meeting primary health needs [2].
Badan Penyelenggara Jaminan Sosial (BPJS) was established
according to law number 24 of the year 2011 about BPJS,
with one of its mission that is to improve the quality of
service that is fairness to the participants, health service
providers and stakeholders Other interests through an
effective and efficient working system [2]. BPJS that we
know today is a state-owned enterprise that has undergone
the structural and cultural transformation of social security,
Askes, and ASABRI, who have had a big customer before.
BPJS has 2 types of primary services, namely BPJS
Edwin Pramana
Department of Information Technology
Sekolah Tinggi Teknik Surabaya
Surabaya, Indonesia
Manpower and BPJS Healthcare, from both services, have a
different protection focus. From the two existing services,
the entire system and model of service have been integrated
well, ranging from the medical center, private doctors, and
even a lot of hospitals that have joined the BPJS service
According to health statistical data 2018 conducted by
the Ministry of Health, the use of JKN-KIS program
increases annually. The increase was significantly increased,
from the end of 2017 to 10% of the year at the end of 2018
by 15%[1]. With the addition of BPJS users as public health
services, of course, according to the government plan to
reduce poverty through well-integrated health programs. By
utilizing this program will be very easy in addressing health
problems, since Indonesia is one of the countries that have a
low level of health participation according to the ASEAN
Index Of Health year 2017 precisely In the 3rd order of
ASEAN [3] while many users have increased as they
increase in years, but the previous research found that the
service level and access of the BPJS feature got a lot of
adverse effects and needed to be improved. Services
provided by health workers or health facilities as if
comparing participants BPJS with general patients, so that
many of the emerging negative comments from BPJS
The more negative comments affect the public opinion
about the service gap provided and the complex to access
some features provided by BPJS [4]. Sentiment analysis is
one of the computational research forms from users '
comments, sentiments, and emotions collection. The purpose
of this computative analysis is to determine the polarity of
the document, whether the document has a negative, the
positive or neutral likelihood of the subject discussed [4].
This research aims to analyze public sentiment on the
services and benefits of BPJS. Analysis of documents or data
obtained from social media Twitter in the application of
Rapidminer with the classification method Naïve Bayes.
In a previous study with titled Twitter Sentiment
Analysis of Movie Reviews Using Information Gain and
Naïve Bayes Classifier, this research aims to do the rating of
a film using document analysis on user ratings and
comments on the website. In this study, Naïve Bayes used
the classification model and assisted by the method of
Information Gain by generating an accuracy score of
82.19% from 317 data tests. From the overall class that is
used both positive and negative, it can be seen that much
data is neutral so that the response data is less so maximal
produced [5].
Another study titled Public Services Satisfaction based
on Sentiment Analysis. In this research score was achieved
significantly with a lower neutral response value of the
overall crawling data collected in Table I. With a value of
162 negatives, 12 positives, and 31 neutral. The resulting
score appears that the data can be processed well and differ
in previous research resulting in less useful data due to
differences in the preprocessing phase [6].
Actual Class
In another study with the Twitter Sentiment Analysis of
Online Transportation Service Providers, this research
focuses on online transportation that is in Indonesia. This
study resulted in Go-Jeck as a Media that received much
positive response from the user, different from the Grab. In
this study, the author tried to combine the SVM analysis
model with Naïve Bayes to get maximum results [7].
A. Sentiment analysis
Sentiment analysis or opinion mining refers to the broad
field of natural language processing, computational
linguistics, and text mining which aims to analyze the
opinions, sentiments, evaluation, attitudes, judgments, and
emotions whether the speaker or writer with respect to a
topic, product, service, organization , individuals, or specific
activity [6]. The primary task in sentiment analysis is
classifying the documents comprising the existing text in a
sentence or a document and determine the opinions
expressed in the sentence or whether the documents are either
positive, negative, or neutral[8].
Expressions that refer to the focus of a particular topic,
statements on a topic may differ in meaning with the same
statement on a different subject. Therefore, in some studies,
especially on product reviews, an analysis was preceded by
determining the elements of a product talked about before
beginning the opinion mining process [9].
B. Text Mining
Text mining is the process of analyzing the text data,
where the primary data obtained from the document[10].
Text Mining used in the classification of textual documents
where the document classified according to the topic. With
the help of text mining, an article can be known by the words
in the article. Words that can represent the contents of the
article are analyzed and matched on a predefined keyword
database. So, in the presence of text mining can help to group
a document in a short time.
The stage in analyzing text mining is to collect the data
and then to extract the features to be used. Text mining can
be broadly defined as a process of intensive knowledge
where users interact with the document collection over time
using separable analysis tools. Text Mining seeks to extract
useful information from data sources through compelling
identification and exploration of patterns. Text mining trends
to lead to the field of data mining research. Therefore, it is
not surprising that text mining and data mining are at the
same level of architecture[11].
Text Mining can be considered a two-step process that
begins with the application of the structure of the text data
sources and continues with the extraction of relevant
information and knowledge from unstructured text data by
using the same techniques and tools in data mining[12]. The
concept of text mining is to be used in text document
classification, in which the documents are classified
according to the topic. With the help of text mining, an
article or paper can be known category through word or
phrase contained in the article. The word can be represented
contents of the article is analyzed and matched based on
keyword data that has been predetermined. So with the text
mining can help to group a document in a short time.
C. Naïve Bayes
In the Naïve Bayes Classifier method, a text document is
represented as a collection of words, where each word in the
document is assumed to be independent of each other. The
advantages of this method are simple but have the right level
of accuracy.
According to one of the researches titled sentiment
Analysis of Indonesian-language tweets with Deep Belief
Network. Naive Bayes Classifier is one of the machine
learning methods that use probability calculations. The
advantage of using the Naïve Bayes method is that it only
requires a small amount of training data to predict the
parameters required for the classification [13]. From the
research, obtained accuracy of up to 88.5% with a document
of 3000 tweets collected social media. Naïve Bayes is one of
the methods in artificial intelligence that perform probability
calculations. The advantage of this method is that it only
requires a bit of training data to predict the needed
parameters. At the time of classification, Naive Bayes look
for the highest Probablitias value using the following
Ka = argmaxP(X1, X2, X3,…… Xn).P(K)
Description :
Ka = All categories for Testing
X1, X2, X3,…… Xn = Each word in a tweet
P(K) = Probabilistic category
On this classification this time, we use the formula.
P(X1|K) = P(K) X1, X2, X3,…… Xn
P(X1|K): All Categories
P(K) : Probabilistic Category
X1, X2, X3…… Xn = Each word in a tweet
K: the total value of each category
D. Term Weighting Tf-Idf
A Term frequency (TF) is a simple measurement in the
weighing method. This method, every term assumed to have
a proportion of interests according to the number of
occurrences (emergence) in the text (document). Term
frequency can improve the recall value of retrieval
information, but not always fix the precision value[14].
The Inverse document frequency, commonly abbreviated
IDF, is a term for a more focused method of paying
attention to the term occurrence of the whole text set. On
IDF, a rare term that appears in the entire collection of text
is judged to be more valuable. The value of each term
interest is assumed inversely with the amount of text
containing the term[9].
E. Natural Language Toolkit
The Natural Language Toolkit is a tool developed
specifically for the Python programming language and used
in the process that relates to Natural language Processing.
The Natural Language Toolkit provides an easy to use and
usable interface and provides more than 50 data that can be
used such as natural language processing such as WordNet
and TextProcessing library as well as for processing
Classifications such as tokenization, stemming, tagging,
parsing and semantic reasoning[15].
A. Data Collection
In this research, the data used is a collection of responses
Tweet Indonesian society as well as user BPJS the health of
the group from the official account BPJS like Fig. 1.
The tweet data set that has been downloading the form
of the next plan text is the Stemming and Stopword
processes to produce clean data from the junk text, the next
step we can see in Fig. 2.
Fig. 2. The process of data collection and preprocessing
B. Text Preprocessing
In Fig. 2 we can see the research process started from the
collection of data from Twitter social media that we know a
lot of the data sets of unstructured text that we can make data
objects. Furthermore, the data that has been collecting is
done by the advanced processing stage by preprocessing with
several stages: Abbreviations and Acronyms.
Lowercase Selection: Lowercase Selection is one of
the techniques in Text Mining, where all
downloaded data is converting to Lowercase
Type[13]. At this stage, the test data obtained from
Twitter in the form of unstructured documents,
there are many types of it all the text is converting
to lowercase without capitalization.
Remove URL: URL Removing is a technique in
text mining that filters all text that has links or
URLs to other websites and Situs[13]. In this stage,
all text documents have been collected and have the
URL of specific sites are gathered and a finalizing,
so the text becomes a structured rapid.
Hashtag Remove: Hashtag Remove is a cleaning
technique at the preprocessing text stage by
utilizing and focusing on a text containing hashtag
(#) so that the data set of text becomes unstructured.
At this stage, the collected text data is filtered back
to find the words that hashtag appear and repeat.
Fig. 1. Official account BPJS
The collection of tweets in Extra is the tweet data with a
span of 3 months from February to April 2019, which in
those months are warm issues related to BPJS dues increase
in health. Documents extracted with Rapidminer's tools
focusing on a predefined timeframe, in the process of
crawling the data of the Twitter element that focused on the
search is around the word dues, services, information,
medicines, and health facilities. The keywords used are some
sample samples of response tweets from the community, as
in Table II.
“Pembayaran Iuran Terlalu Mahal”
“Pelayanan Tiap Faskes dibedakan”
“Pengobatan yang digunakan lebih
“Informasi dari CS sangat membantu”
“Banyak Fasilitas Kesehatan kelas 1
kurang lengkap”
After going through the process of preprocessing, then
produce datasets that are ready to do further classification
with a Naive Bayes classifier method. In the classification
process, as seen in Fig. 3, the process appears to take place
by way of labeling the set of existing data sets. In the training
document labeling process was done manually by domain
predetermined by several categories. The determination label
functioned provides guidance on the classification of
documents or to a group following the appropriate label.
Three labels, either positive, negative, and neutral in use for
data sorting tweet containing a set of responses from users.
Results obtained from testing 100 random data that has
been in the polarity classification manually by using 1400
training data get an accuracy of 90%.
C. Distributed Tweet Result
Fig. 3. Classification process
C. Satisfaction Testing
dissatisfaction model, where data is taken from the number
of tweets that exist and then calculated using the following
(x) = (r.j)/Ar
(y) = (r.j)/Ar
Fig. 4 is a graph of community opinion spreading about
BPJS health in social media. From 3728 tweets that have
been successfully crawling and have been labeling based on
the class type of the tweet and eliminating neutral response
because it is considering Not support in the sentiment
assessment of a product. From the whole document that has
been on the labeling look, the negative response looks more
significant than the positive response from the user on the
official account BPJS health. The spread of positive opinion
is dominated by 52% of users with the majority of discussion
related to information as well as drugs while the negative
response in domination related to the topic of dues or
payment and related services provided by BPJS in patients.
For details of the tweet's deployment data, we can read more
about the response Mapping in Fig. 5.
r = Number of Positive Tweets
j = Number of Negative Tweets
Ar = Total Tweets
A. Research Data Collection Results
Total documents downloaded from Twitter amounted to
3728 with a span of data retrieval for three months, the
collection of data focused on the Indonesian Area because
the majority of the primary users of BPJS are Indonesian
people in general. The period used in the resulting
consideration of the many issues of the tuition increase and
the commitment of presidential candidates in developing
health services in the community. Details of the data
collection period can be seen in Table III.
Pelayanan Kesehatan
Iuran pembayaran BPJS
Fasilitas Kesehatan BPJS
Fig. 4. User tweet distributions
Informasi Pelayanan
February – April 2019
Obat-Obatan BPJS
Fig. 5. Tweet graph of BPJS
B. Naïve Bayes Classifier Testing
Data that has been obtained from the mining process on
social media Twitter several 3728 tweet data, and then the
document is labeling with Naïve Bayes Classifier. From the
data that has been labeling, then the test was performed to
measure the accuracy level of the Naïve Bayes method on
BPJS health.
Accuration = Total Correct Predictions x 100%
Total Data
The community has been causing by the payment of 2030
negative tweets as well as 1324 positive tweets. A further
negative response is also from the Topic of service provided
by BPJS, amounting to 1950 negative tweets, and 1678
positive tweets. However, there is all four warm Topic in the
talk. The related Topic of drugs and information received
many positive responses from the community with a total of
2167 tweets for drugs and 2300 tweets about ease of
information. The opinion has presented the community over
the response related to payments and services are warmly
discussed from the Topic of the drug and the ease of
information provided BPJS.
D. Data Satisfaction Test
At this stage, the test is enabled to measure the level of
user satisfaction based on their response and the possibility
of a user in using BPJS services. In user satisfaction testing,
measurements using the Derived model Dissatisfaction with
the use of data 3728 tweets that have been collected and have
been through the stages of pre-processing and labeling stage,
accumulated some data based on labeling category and
response to Table IV.
