An Efficient Approach for Extraction of Disease-Treatment Relation Ancy Sudhakar M-tech Computer Science and Engineering, KMCT College of Engineering Calicut, Calicut University Calicut City, India aachasudhakar@gmail.com Abstract: The Machine learning and the natural language processing techniques are used in various medical domains. These methods are used for extracting relevant information in the medical domains. In this paper we describe an efficient health care system for identification of relations between diseases and treatments. In addition to this it also identifies the symptoms associated with a particular disease. The Multinominal Naive Bayes algorithm is used for this extraction method. Other than this the stop word removal process and stemming algorithm is used. With the help of this proposed system we can make efficient medical decisions. Keywords: Health care system, Machine learning, Medline, Natural Language Processing, Stop words Removal. I. INTRODUCTION Health is the level of functional or metabolic efficiency of living organism. We know that people are more concerned about their health. Now a day’s there are lots of tools such as Google Health and Microsoft Health Vault that helps us in managing our health. But even though we have such tools they have lots of disadvantages such as they always find difficulties in extracting information from a group of data and also their classification performance is very weak. So we need a system that eliminates such problems. The proposed system eliminates such problems and provides us more reliable access to information. In this work we make use of various medical related databases such as Medline in which all relevant medical articles are published. We know that people have no time for reading the complete article to know about a particular disease and its treatment. So they need a system that extracts relevant information from these articles more easily and efficiently and the proposed system provides these facilities. For this we make use of the text mining approach. This system extracts information regarding diseases, treatments, symptoms and the three semantic relations between diseases and treatments such as cure, prevent and side effect relations. For extracting such information firstly all unwanted information contained in the article must be eliminated. For this different method such as stop word removal, repeated words removal and Multinominal Naive Bayes algorithms are used. The methods used in this system provide the users only informative sentences. So that the users especially the doctors need not have to spend much time in analyzing about a particular disease by reading these articles. In addition to this the classification performance is very much improved. II. LITERATURE SURVEY The most important work here is the work done by Rosario and Hearst [1] and they are the ones who have distributed the datasets used in this work. The dataset used include sentences from Medline abstract annotated with diseases and treatments. In their work they have made use of Hidden Markov Model and also this paper deals with the mapping of biomedical information into structural format. They deal with seven relations such as cure, prevent, side effect, only disease, only treatment, no cure and vague relations. For extracting the semantic relations they make use of Naive Bayes algorithm. The method of information extraction is very costly and the method of decreasing the cost of information extraction is described in the work done by M. Craven [2]. In this work they deal with the method of information extraction via relational learning and information extraction via text classification. In this the Naive Bayes classifier with bag of words representation is used for classification purpose and this approach is a statistical approach. Relations between different entities are dealt by Ray and Craven [3] .In their work they find relation between genes and disorders and also they find the location of particular protein within a cell. Various relations are extracted via medical subject heading and via semantic representation in the work done by Srinivasan and Rindflesch [4]. In this work exploration of some text mining opportunities offered by Medline database is done. In some cases AdaBoost classifier [5] with bag of words representation is used. In these cases the whole abstract is considered as a source of information than considering a single sentence. Different other methods such as pre-processing, parsing and error recovery [6] are used for identification of relation between two different entities. The representation techniques such as bag of words and different classification algorithms such as Naive Bayes, Compliment Naive Bayes, Support Vector Machine etc are specified in the work done by Oana Frunza and Diana Inkpen [7]. In this work they find the performance of different classifiers and they make use of precision, recall and f-measure for this purpose. The direct and indirect comparison between different classifiers is done. Different classifiers are compared with the baseline classifier. III. PROPOSED APPROACH The medical databases such as Medline contain lots of medical related articles. The main problem with these databases is that all research discoveries come and enter the repository at high rate making the process of identifying and disseminating the healthcare information a difficult task [7]. Here the proposed approach eliminates such problem and introduces a new method for easily extracting relevant medical information from these articles. The proposed system displays to the user information regarding diseases, treatments and symptoms and in addition it also identifies three semantic relations between diseases and treatments. Always an article contains both relevant and irrelevant information. So firstly we have to avoid this irrelevant information from these articles and find the required information from the remaining relevant data. The elimination or removal of irrelevant information makes the process of text mining approach much easier. Different methods are used for this removal of unwanted information. The first step which is used for removal of irrelevant information is stop word removal process. In this stop word removal we remove all stop words such as a, an, is etc from the article and after the stop word removal we remove the repeated words with the help of stemming algorithm. This both steps reduce the content of the document but it increases the quality of the result. The figure 1 below shows the system architecture of the proposed system: Figure 1. The System Architecture of Proposed System All the process employed here are performed in a pipelined manner, First the stop word removal process. In this removal all stop words such as a, an, is, the etc are removed. There are about 174 English stop words and all are removed from the text file. By stop word removal even though the size of the document is reduced the quality of result is improved. The document obtained after the removal of stop words is then input for the removal of repeated words. For the repeated words removal we make use of stemming algorithm. There are lots of stemming algorithms and we make use of suffix stripping stemming algorithm. For example the document may contain the words like connected, connecting etc and the stemming algorithm reduces this to its stem as connect. After the removal of repeated words the next step is to apply the Multinominal Naive Bayes algorithm for finding the relations between diseases and treatments. This also identifies the symptoms associated with a particular disease in addition to the three relations cure, prevent and side effect relations. In earlier days the Naive Bayes algorithm was used for finding the semantic relations between two different entities. As there are some drawbacks associated with this Naive Bayes algorithm we make use of this Multinominal Naive Bayes algorithm for text classification. This Multinominal Naive Bayes algorithm is a specialized version of Naive Bayes algorithm. The main drawback with the Naive Bayes algorithm is that it assumes that the attributes are all independent of each other given the class. For example consider the classifier for in assessing the risk of issuing a credit card. In this case the Naive Bayes algorithm assumes that there is no relation or dependency between a customer’s age, income, education and salary provided the customer is classified as a worthy customer. So to avoid these problems we make use of the Multinominal Naive Bayes algorithm. The above proposed approach helps in saving time of various users especially doctors in analyzing about a particular disease. The methods used in this system can also be used for finding the relation between different other entities and it can be used in various other applications. The results can be found out with the help of precision, recall and f-measure values. The important modules used in this system are described below in detail. A. Relevant Information Extraction from Medical Articles There are lots of medical databases in which medical related articles are published. For example Medline database, which contains all the relevant medical articles. From these databases we extract only the relevant information. For this purpose we make use of the bag of words representation. In this representation technique each word is considered as the feature for training the classifier which we use. 1. Removal of stop words: Stop words are words which are filtered out prior to or after processing of natural language data. There are about 174 English stop words for example: a, an, the etc. These stop words are removed from the document. By removal of these stop words the content of the document is reduced but the quality of the result is improved. 2. Repeated words removal: After the removal of stop words next the repeated word removal process is done. Here the stemming algorithm is used for this purpose. The suffix stripping algorithm is the stemming algorithm used here. It strips away some of the ending terms and produces the base or stem form of the word. In some cases suffix substitution is used. For example the presence of the words expressed, expressing etc are reduced to its base form express. B. Sentence Identification and Relation Identification Process In sentence identification process the sentences containing information disease, treatment and symptoms are identified. Next step is to find the relation that exists between diseases and treatments. The three semantic relations are identified such as cure, prevent and side effect relation. For this we make use of the Multinominal Naive Bayes algorithm. Multinominal Naive Bayes algorithm takes into account the number of times a word occurs in a document or a text file. A training set and test set is required for the relation identification task. The training set to train the system and the test set to test the performance of the algorithm. C. Output Performance Evaluation The output performance of the system is evaluated for different other documents. The result obtained shows information regarding diseases, treatments and symptoms and the three semantic relations between diseases and treatments. The result we obtain is more efficient and accurate. IV. CONCLUSION The proposed approach provides us only informative sentences by avoiding all uninformative information. This system helps various users especially doctors in making certain medical decisions. In addition it helps doctors in analyzing about a particular disease and its treatments. Other than these entities we can find relationship between different other entities. In various healthcare domains and other medical applications we can make use of this system. REFERENCES 1. B. Rosario and M.A. Hearst, “Semantic Relations in Bioscience Text,” Proc. 42nd Ann. Meeting on Assoc. For Computational Linguistics, vol. 430,2004. 2. M. Craven, “Learning to Extract Relations from Medline,” Proc. Assoc. for the Advancement of Artificial Intelligence, 1999. 3. S. Ray and M. Craven, “Representing Sentence Structure in Hidden Markov Models for Information Extraction,” Proc. Int’l Joint Conf. Artificial Intelligence (IJCAI ’01), 2001. 4. P. Srinivasan and T. Rindflesch, “Exploring Text Mining from Medline,” Proc. Am. Medical Informatics Assoc. (AMIA) Symp., 2002. 5. O. Frunza and D. Inkpen, “Textual Information in Predicting Functional Properties of the Genes,” Proc. Workshop Current Trends in Biomedical Natural Language Processing (BioNLP) in conjunction with Assoc. for Computational Linguistics (ACL ’08), 2008. 6. C. Friedman, P. Kra, H. Yu, M. Krauthammer, and A. Rzhetsky, “GENIES: A Natural Language Processing System for the Extraction of Molecular Pathways from Journal Articles,” Bioinformatics, vol.17, pp. S74-S82, 2001. 7. Oana Frunza.et.al, “A Machine Learning Approach For Identifying DiseaseTreatment Relations In Short Texts”, May 2011. Ancy Sudhakar received her B-tech degree in Computer Science and Engineering from College of Engineering Vadakara, Cochin University, Kerala, India and is presently doing her final year Master of Technology from KMCT College of Engineering Calicut, Calicut University, Kerala, India. Her research interest includes Data Mining and Knowledge Engineering.