2019 2nd International Conference of Computer and Informatics Engineering (IC2IE) Naive Bayes Classifier on Twitter Sentiment Analysis BPJS of HEALTH Sepyan Purnama Kristanto, Junaedi Adi Prasetyo Department of Informatics Engineering Politeknik Negeri Banyuwangi Banyuwangi, Indonesia sepyan@poliwangi.ac.id, junaedi.prasetyo@poliwangi.ac.id Abstract— Public health insurance is one indicator of the success of the government's active role in managing and facilitating its citizens. Health media and excellent facilities undoubtedly read a positive impact on the development of society, especially at this time. BPJS, as a government health media for the people of Indonesia, of course, must bring change and be a solution to the imbalance of health services for small and medium people. Sentiment analysis of BPJS products is one solution to get information on the active role of the community as the primary user of their health products. Sentiment analysis is carried out by utilizing social media as the primary basis for data collection. In this study, the initial stage taken was data collection and continued to do Post Tagging on community tweet data. Furthermore, these data are classified again using the Naïve Bayes model to obtain optimal results. The results of the study note that BPJS health services get an accuracy rate of 70% negative for payment topics and 72% positive for information topics, and get a 65% score likely from users in using BPJS services as their health service. Keywords—twitter sentiment analyst, naïve Bayes, bpjs of health I. INTRODUCTION Health is one of the main points of service and government success in providing its policies [1]. Equitable distribution of healthcare brings positive impacts on the development and community economy of all levels. BPJS is one form of government commitment to facilitate its people in the form of state-owned enterprises and sheltered by the state as the parent of each service. Many good hopes of society and government so that BPJS can be a solution of health gaps that many of us know from the site seconds that the service and drug costs are often soaring and not reachable for some components of the community. We know that social security in health or health social insurance is further regulated in Article 19 paragraph (2) of UUSJSN which determines that the health insurance is held to guarantee For participants to obtain health care and protection benefits in meeting primary health needs [2]. Badan Penyelenggara Jaminan Sosial (BPJS) was established according to law number 24 of the year 2011 about BPJS, with one of its mission that is to improve the quality of service that is fairness to the participants, health service providers and stakeholders Other interests through an effective and efficient working system [2]. BPJS that we know today is a state-owned enterprise that has undergone the structural and cultural transformation of social security, Askes, and ASABRI, who have had a big customer before. BPJS has 2 types of primary services, namely BPJS 978-1-7281-2384-4/19/$31.00 ©2019 IEEE Edwin Pramana Department of Information Technology Sekolah Tinggi Teknik Surabaya Surabaya, Indonesia epramana@stts.edu Manpower and BPJS Healthcare, from both services, have a different protection focus. From the two existing services, the entire system and model of service have been integrated well, ranging from the medical center, private doctors, and even a lot of hospitals that have joined the BPJS service program. According to health statistical data 2018 conducted by the Ministry of Health, the use of JKN-KIS program increases annually. The increase was significantly increased, from the end of 2017 to 10% of the year at the end of 2018 by 15%[1]. With the addition of BPJS users as public health services, of course, according to the government plan to reduce poverty through well-integrated health programs. By utilizing this program will be very easy in addressing health problems, since Indonesia is one of the countries that have a low level of health participation according to the ASEAN Index Of Health year 2017 precisely In the 3rd order of ASEAN [3] while many users have increased as they increase in years, but the previous research found that the service level and access of the BPJS feature got a lot of adverse effects and needed to be improved. Services provided by health workers or health facilities as if comparing participants BPJS with general patients, so that many of the emerging negative comments from BPJS patients. The more negative comments affect the public opinion about the service gap provided and the complex to access some features provided by BPJS [4]. Sentiment analysis is one of the computational research forms from users ' comments, sentiments, and emotions collection. The purpose of this computative analysis is to determine the polarity of the document, whether the document has a negative, the positive or neutral likelihood of the subject discussed [4]. This research aims to analyze public sentiment on the services and benefits of BPJS. Analysis of documents or data obtained from social media Twitter in the application of Rapidminer with the classification method Naïve Bayes. II. RELATED RESEARCH In a previous study with titled Twitter Sentiment Analysis of Movie Reviews Using Information Gain and Naïve Bayes Classifier, this research aims to do the rating of a film using document analysis on user ratings and comments on the website. In this study, Naïve Bayes used the classification model and assisted by the method of Information Gain by generating an accuracy score of 82.19% from 317 data tests. From the overall class that is used both positive and negative, it can be seen that much 24 10-11 September, Indonesia-Banyuwangi, East Java 2019 2nd International Conference of Computer and Informatics Engineering (IC2IE) data is neutral so that the response data is less so maximal produced [5]. Another study titled Public Services Satisfaction based on Sentiment Analysis. In this research score was achieved significantly with a lower neutral response value of the overall crawling data collected in Table I. With a value of 162 negatives, 12 positives, and 31 neutral. The resulting score appears that the data can be processed well and differ in previous research resulting in less useful data due to differences in the preprocessing phase [6]. TABLE I. NAÏVE BAYES RESULT Actual Class Positive Negative Neutral Positive Negative Netral Total 12 17 0 29 30 162 8 198 8 5 31 44 In another study with the Twitter Sentiment Analysis of Online Transportation Service Providers, this research focuses on online transportation that is in Indonesia. This study resulted in Go-Jeck as a Media that received much positive response from the user, different from the Grab. In this study, the author tried to combine the SVM analysis model with Naïve Bayes to get maximum results [7]. A. Sentiment analysis Sentiment analysis or opinion mining refers to the broad field of natural language processing, computational linguistics, and text mining which aims to analyze the opinions, sentiments, evaluation, attitudes, judgments, and emotions whether the speaker or writer with respect to a topic, product, service, organization , individuals, or specific activity [6]. The primary task in sentiment analysis is classifying the documents comprising the existing text in a sentence or a document and determine the opinions expressed in the sentence or whether the documents are either positive, negative, or neutral[8]. Expressions that refer to the focus of a particular topic, statements on a topic may differ in meaning with the same statement on a different subject. Therefore, in some studies, especially on product reviews, an analysis was preceded by determining the elements of a product talked about before beginning the opinion mining process [9]. B. Text Mining Text mining is the process of analyzing the text data, where the primary data obtained from the document[10]. Text Mining used in the classification of textual documents where the document classified according to the topic. With the help of text mining, an article can be known by the words in the article. Words that can represent the contents of the article are analyzed and matched on a predefined keyword database. So, in the presence of text mining can help to group a document in a short time. The stage in analyzing text mining is to collect the data and then to extract the features to be used. Text mining can be broadly defined as a process of intensive knowledge where users interact with the document collection over time using separable analysis tools. Text Mining seeks to extract useful information from data sources through compelling identification and exploration of patterns. Text mining trends to lead to the field of data mining research. Therefore, it is not surprising that text mining and data mining are at the same level of architecture[11]. Text Mining can be considered a two-step process that begins with the application of the structure of the text data sources and continues with the extraction of relevant information and knowledge from unstructured text data by using the same techniques and tools in data mining[12]. The concept of text mining is to be used in text document classification, in which the documents are classified according to the topic. With the help of text mining, an article or paper can be known category through word or phrase contained in the article. The word can be represented contents of the article is analyzed and matched based on keyword data that has been predetermined. So with the text mining can help to group a document in a short time. C. Naïve Bayes In the Naïve Bayes Classifier method, a text document is represented as a collection of words, where each word in the document is assumed to be independent of each other. The advantages of this method are simple but have the right level of accuracy. According to one of the researches titled sentiment Analysis of Indonesian-language tweets with Deep Belief Network. Naive Bayes Classifier is one of the machine learning methods that use probability calculations. The advantage of using the Naïve Bayes method is that it only requires a small amount of training data to predict the parameters required for the classification [13]. From the research, obtained accuracy of up to 88.5% with a document of 3000 tweets collected social media. Naïve Bayes is one of the methods in artificial intelligence that perform probability calculations. The advantage of this method is that it only requires a bit of training data to predict the needed parameters. At the time of classification, Naive Bayes look for the highest Probablitias value using the following formula. Ka = argmaxP(X1, X2, X3,…… Xn).P(K) (1) Description : Ka = All categories for Testing X1, X2, X3,…… Xn = Each word in a tweet P(K) = Probabilistic category On this classification this time, we use the formula. P(X1|K) = P(K) X1, X2, X3,…… Xn (2) K Description: P(X1|K): All Categories P(K) : Probabilistic Category X1, X2, X3…… Xn = Each word in a tweet K: the total value of each category D. Term Weighting Tf-Idf A Term frequency (TF) is a simple measurement in the weighing method. This method, every term assumed to have a proportion of interests according to the number of occurrences (emergence) in the text (document). Term 25 2019 2nd International Conference of Computer and Informatics Engineering (IC2IE) frequency can improve the recall value of retrieval information, but not always fix the precision value[14]. The Inverse document frequency, commonly abbreviated IDF, is a term for a more focused method of paying attention to the term occurrence of the whole text set. On IDF, a rare term that appears in the entire collection of text is judged to be more valuable. The value of each term interest is assumed inversely with the amount of text containing the term[9]. E. Natural Language Toolkit The Natural Language Toolkit is a tool developed specifically for the Python programming language and used in the process that relates to Natural language Processing. The Natural Language Toolkit provides an easy to use and usable interface and provides more than 50 data that can be used such as natural language processing such as WordNet and TextProcessing library as well as for processing Classifications such as tokenization, stemming, tagging, parsing and semantic reasoning[15]. III. METHODOLOGY A. Data Collection In this research, the data used is a collection of responses Tweet Indonesian society as well as user BPJS the health of the group from the official account BPJS like Fig. 1. The tweet data set that has been downloading the form of the next plan text is the Stemming and Stopword processes to produce clean data from the junk text, the next step we can see in Fig. 2. Fig. 2. The process of data collection and preprocessing B. Text Preprocessing In Fig. 2 we can see the research process started from the collection of data from Twitter social media that we know a lot of the data sets of unstructured text that we can make data objects. Furthermore, the data that has been collecting is done by the advanced processing stage by preprocessing with several stages: Abbreviations and Acronyms. • Lowercase Selection: Lowercase Selection is one of the techniques in Text Mining, where all downloaded data is converting to Lowercase Type[13]. At this stage, the test data obtained from Twitter in the form of unstructured documents, there are many types of it all the text is converting to lowercase without capitalization. • Remove URL: URL Removing is a technique in text mining that filters all text that has links or URLs to other websites and Situs[13]. In this stage, all text documents have been collected and have the URL of specific sites are gathered and a finalizing, so the text becomes a structured rapid. • Hashtag Remove: Hashtag Remove is a cleaning technique at the preprocessing text stage by utilizing and focusing on a text containing hashtag (#) so that the data set of text becomes unstructured. At this stage, the collected text data is filtered back to find the words that hashtag appear and repeat. Fig. 1. Official account BPJS The collection of tweets in Extra is the tweet data with a span of 3 months from February to April 2019, which in those months are warm issues related to BPJS dues increase in health. Documents extracted with Rapidminer's tools focusing on a predefined timeframe, in the process of crawling the data of the Twitter element that focused on the search is around the word dues, services, information, medicines, and health facilities. The keywords used are some sample samples of response tweets from the community, as in Table II. TABLE II. No 1 2 3 Keyword Iuran Pelayanan Obat 4 5 DATA GROUP TWEET Class Negative Negative Positive Informasi Result “Pembayaran Iuran Terlalu Mahal” “Pelayanan Tiap Faskes dibedakan” “Pengobatan yang digunakan lebih manfaat” “Informasi dari CS sangat membantu” Fasilitas Kesehatan “Banyak Fasilitas Kesehatan kelas 1 kurang lengkap” Negative Positive After going through the process of preprocessing, then produce datasets that are ready to do further classification with a Naive Bayes classifier method. In the classification process, as seen in Fig. 3, the process appears to take place by way of labeling the set of existing data sets. In the training document labeling process was done manually by domain predetermined by several categories. The determination label functioned provides guidance on the classification of documents or to a group following the appropriate label. Three labels, either positive, negative, and neutral in use for data sorting tweet containing a set of responses from users. 26 2019 2nd International Conference of Computer and Informatics Engineering (IC2IE) Results obtained from testing 100 random data that has been in the polarity classification manually by using 1400 training data get an accuracy of 90%. C. Distributed Tweet Result Fig. 3. Classification process C. Satisfaction Testing Testing satisfaction using the probabilistic dissatisfaction model, where data is taken from the number of tweets that exist and then calculated using the following formula: (x) = (r.j)/Ar (y) = (r.j)/Ar (3) (4) Fig. 4 is a graph of community opinion spreading about BPJS health in social media. From 3728 tweets that have been successfully crawling and have been labeling based on the class type of the tweet and eliminating neutral response because it is considering Not support in the sentiment assessment of a product. From the whole document that has been on the labeling look, the negative response looks more significant than the positive response from the user on the official account BPJS health. The spread of positive opinion is dominated by 52% of users with the majority of discussion related to information as well as drugs while the negative response in domination related to the topic of dues or payment and related services provided by BPJS in patients. For details of the tweet's deployment data, we can read more about the response Mapping in Fig. 5. Information r = Number of Positive Tweets j = Number of Negative Tweets Ar = Total Tweets IV. RESULT AND DISCUSSIONS A. Research Data Collection Results Total documents downloaded from Twitter amounted to 3728 with a span of data retrieval for three months, the collection of data focused on the Indonesian Area because the majority of the primary users of BPJS are Indonesian people in general. The period used in the resulting consideration of the many issues of the tuition increase and the commitment of presidential candidates in developing health services in the community. Details of the data collection period can be seen in Table III. TABLE III. DATA GROUP TWEET Keyword Pelayanan Kesehatan Iuran pembayaran BPJS Fasilitas Kesehatan BPJS Fig. 4. User tweet distributions Time February 2019 – February 2019 – February 2019 – Total Positive 724 Negative April Netral April 822 April 836 Informasi Pelayanan BPJS February – April 2019 Obat-Obatan BPJS February 2019 804 – Fig. 5. Tweet graph of BPJS April 542 B. Naïve Bayes Classifier Testing Data that has been obtained from the mining process on social media Twitter several 3728 tweet data, and then the document is labeling with Naïve Bayes Classifier. From the data that has been labeling, then the test was performed to measure the accuracy level of the Naïve Bayes method on BPJS health. Accuration = Total Correct Predictions x 100% Total Data (5) The community has been causing by the payment of 2030 negative tweets as well as 1324 positive tweets. A further negative response is also from the Topic of service provided by BPJS, amounting to 1950 negative tweets, and 1678 positive tweets. However, there is all four warm Topic in the talk. The related Topic of drugs and information received many positive responses from the community with a total of 2167 tweets for drugs and 2300 tweets about ease of information. The opinion has presented the community over the response related to payments and services are warmly discussed from the Topic of the drug and the ease of information provided BPJS. 27 2019 2nd International Conference of Computer and Informatics Engineering (IC2IE) D. Data Satisfaction Test At this stage, the test is enabled to measure the level of user satisfaction based on their response and the possibility of a user in using BPJS services. In user satisfaction testing, measurements using the Derived model Dissatisfaction with the use of data 3728 tweets that have been collected and have been through the stages of pre-processing and labeling stage, accumulated some data based on labeling category and response to Table IV. TABLE IV. Keyword REFERENCES [1] P. Rakyat, “Kesehatan Rakyat,”[People Health] 2017. [Online]. Available:https://www.pikiran-rakyat.com/bandungraya/2017/05/18/kesehatan-kunci-penting-kesuksesan-401483. [2] R. Y. Yanis, “Sentiment Analysis of Bpjs Kesehatan Services To Smk Eklesia and Bina Insani Jailolo Teachers,” J. Terap. Teknol. Inf., vol. 2, no. 2, pp. 25–34, 2018. [3] ASEAN, “ASEAN ANNUAL,” ASEAN ANNUAL REPORT, 2017. [Online]. Available: https://asean.org/wpcontent/uploads/2018/08/ASEAN-Annual-Report-2017-2018.pdf. [4] KOMPASIANA, “Asean,” positif dan negatif. [Online]. Available: https://www.kompasiana.com/zamaldikjemen/59db26d8bde5752 1d11c7742/positif-dan-negatif-bpjs?page=all. [5] S. Widya Sihwi, I. Prasetya Jati, and R. Anggrainingsih, “Twitter Sentiment Analysis of Movie Reviews Using Information Gain and Naïve Bayes Classifier,” Proc. - 2018 Int. Semin. Appl. Technol. Inf. Commun. Creat. Technol. Hum. Life, iSemantic 2018, pp. 190–195, 2018. [6] E. Susilawati, “Public services satisfaction based on sentiment analysis: Case study: Electrical services in Indonesia,” 2016 Int. Conf. Inf. Technol. Syst. Innov. ICITSI 2016 - Proc., 2017. [7] S. Anastasia and I. Budi, “Twitter sentiment analysis of online transportation service providers,” 2016 Int. Conf. Adv. Comput. Sci. Inf. Syst. ICACSIS 2016, pp. 359–365, 2017. [8] B. Liu, “LIBRO_NLP-handbook-sentiment-analysis,” pp. 1–38, 2010. [9] W. A. Luqyana, I. Cholissodin, and R. S. Perdana, “Analisis Sentimen Cyberbullying Pada Komentar Instagram dengan Metode Klasifikasi Support Vector Machine”,[Analysis of Cyberbullying Sentiments on Instagram Comments by the Support Vector Machine Classification], J. Pengemb. Teknol. Inf. dan Ilmu Komput. Univ. Brawijaya, vol. 2, no. 11, pp. 4704– 4713, 2018. [10] D. S. Indraloka and B. Santosa, “Penerapan Text Mining untuk Melakukan Clustering Data Tweet Shopee Indonesia,” [ Application of Text Mining to Cluster the Tweet Shopee Indonesia Data] J. Sains dan Seni ITS, vol. 6, no. 2, pp. 6–11, 2017. [11] Sholom M. Weiss Nitin Indurkhya Tong Zhang Fred J. Damerau, Predictive Methods for Analyzing Unstructured Information, vol. 24, no. 1. 2018. RESPONSE USER Total Positive Negative Pelayanan Kesehatan 724 311 413 Iuran pembayaran BPJS 822 255 567 Fasilitas Kesehatan BPJS 836 455 381 Informai Pelayanan BPJS 804 577 227 Obat-Obatan BPJS 542 344 198 At satisfaction testing using the model Derived Dissatisfaction testing and focused on the positive and negative response of the user, then the entire data is calculated with the relative frequency probability model with the result as following: (X) (1928*2)/3728 = 1,041845494 (Y) (1786*2)/3728 = 0,958154506 =1,041845494/0,958154506 * 100 % P = 65 % Of the calculation of user possibilities using BPJS in realtime generates 65% value from 3728 user tweets. V. CONCLUSIONS Based on the test results, the level of accuracy classification conducted by the Naïve Bayes method as well. As the community response tweet analysis. Related to some critical topic of BPJS health obtained a response of 70%, the community responds negatively related. The payment they should make. Further, 60% of the negative community responds to related topics of service provided by BPJS Health, the next majority respond to 55% of positively related medications as well as responding to 72% related information provided regarding service. From the aggregate data, we can conclude that the majority of society responds negatively related to the real service of BPJS healthcare. Next, we get 65% possible users in using BPJS service intensively in their healthcare. Tweet Data on this research is only take in one period, with a phase of capture with a span of 3 months in subsequent studies. The range and frequency can be augmenting by considering the regional that users are concerned with the service BPJS Possible; each region has a different experience and response to BPJS services. [12] J. Votano, M. Parham, and L. Hall, The Text Mining Hand. 2004. [13] A. Pandhu and H. Agus, “Naive Bayes Classification pada Klasifikasi Dokumen Untuk Identifikasi Konten E-Government,” [Naive Bayes Classification in Document Classification for Identification of E-Government Content] J. Appl. Intell. Syst., vol. 1, no. 1, pp. 48-55–55, 2016. [14] J. LING, I. P. E. N. KENCANA, and T. B. OKA, “Analisis Sentimen Menggunakan Metode Naïve Bayes Classifier Dengan Seleksi Fitur Chi Square,” [Sentiment Analysis Using the Naïve Bayes Classifier Method with Chi Square Feature Selection] EJurnal Mat., vol. 3, no. 3, p. 92, 2014. [15] Y. Garg and N. Chatterjee, “Sentiment analysis of twitter feeds,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 8883, pp. 33–52, 2014. 28