International Journal of Engineering Trends and Technology (IJETT) – Volume 28 Number 4 - October 2015 Web-Based Automatic Health Information Guide Premchand#1, Yogesh Rai#2 1 M.Tech. Student, 2Assistant Professor Department of computer science, Shree institute of science and technology, Bhopal, India Abstract: Question answering System is characterized by its efficiency in producing the specific answers for natural language questions. In the time of continuous reachability of internet, generation of answer through question answering system can be very effective, through extracting the required the required information from the resulted webpages, done through web search on question. Due to limitation of performance in complex medical domain question; an efficient system is highly required. Normally, websites are present to provide the information for physician and for end user (patient, internet user), but these systems do not provide only the required and specific answer but also much irrelevant information, unrequired and irrelevant explanation which may not be helpful. The paper illustrates how these limitations can be overcome and select only specific information. Keywords: Question answering System, natural language, Medical domain I. INTRODUCTION In the last few decades, a lot of work is done in the development of question answering system but still there is some space for the further development, as they are limited to short queries. They fail to answer the question made of more than two sentences or performance starts to decrease drastically when query length increases from 20-25 words. Most of the available system takes the input in structured or semi-structured way. Existing health information system often finds difficulties in supporting the long user queries, which can also be noisy. In recent time, internet users have increased drastically. Now a day, lots of normal users are accessing the web based health information system, which are unable to write formal English. They give the information in informal text and system does not analyze the correct question and fails to answer the question. Also, system provides a number of irrelevant information which may be not required by the user and sometimes if system fails to provide the answer, it gives the additional link, which again needs to be accessed and that may/may not be relevant. Usually, when a patient gets ill, it may take some time to reach to the doctor. Meanwhile, the situation of the patients becomes adverse, which can be better if he/she gets the correct information. ISSN: 2231-5381 If the user gets prior to meet the doctor about the medical test (which may be required), patient can save their precious time and in the first meet to doctor, treatment will start. Today, a user has become so clever that he gets all the information about the problem before going to the doctor, so can get the better treatment. And also in clinic, he gets only the medicine for the concerning problem. Normally, physician gives the medicine to the patient but not the information. While these information can save a lot of time to the patient and helps to prevent the same problem again. In this work, we aim to develop a system, made for end user that is able to generate the most specific information user is searching for. The system is capable of handling very long query (like 100 words) of the user. Apart from the disease description, we also provide the tests required, probable symptoms and normal home remedies. First we analyze the user‟s (very long) query and from the question information, generate the template. Template contains the important words, words like disease name, symptoms, age, sex, disease duration. We use this information in query formation and are used for web search. From the web, we retrieve the information. Like other system, we do not use the user‟s question only at the beginning but also use it for the final selection of information. The health information accessed from various web pages are compared again with the help of the information from the user‟s question, to get the most specific answer. The following sections describes about the system in detail. Section 2 presents a study in the existing system. Next section (section 3) discuss about our system. Section 4 presents the result and section 5 presents the discussion. II. Related Works Question answering can be considered as the advanced form of information retrieval. The question answering system is mainly focused on extraction of information. For document generation works done in generic question answering with summary (Amigo et al., 2004). Question answering with query focused summary (Dang, 2005). Clustering based information retrieval (Hearst and Pedersen, 1996; Dumais et al., 2001; Lawrie and Croft, 2003). http://www.ijettjournal.org Page 193 International Journal of Engineering Trends and Technology (IJETT) – Volume 28 Number 4 - October 2015 The work of the machine learning in the medical diagnosis has started from few decades back. An example of early medical diagnosis system is MYCIN. MYCIN was developed in the early years of 1970s and for the identification of infection causing bacteria and to recommend antibiotics based on patient‟s body weight, but it never used in practice. In 80's several attempts have been taken for the development of automatic medical diagnosis system (AMDS); but majorly these systems focus on a particular type of disease. For example thyroid disease diagnosis, craniostenosis syndrome diagnosis (Baim, 1988), dermatoglyptic diagnosis (Chan & Wong, 1989), cardiology (Bratko et al., 1989). In 90s a number of works have been carried out for automatic DS development. Ces-tnik(1990) developed a system for medical problems using m-estimate of probabilities in Naive Bayesian classifier. Kononenko (1991) used semi-Naive Bayesian classifier in the detection of dependencies between attributes. Spiegelhalter et al. (1993) developed a system, which diagnoses the heart disease for new born babies, using naive Bayesian classifier with the efficiency of 67.3%. Work on question answering related to patient care is started from early 90‟s. Doctors used to collect the patient information in such a ways that are helpful to them. In 1993, Ciminoet. al. proposed a framework that analyses the ad-hoc medical questions like, “What is the treatment of X disease” which can be mapped to “How to treat x disease”. In 1999, Text REtrieval Conference (TREC) introduced QA track, focusing to answer the factoid questions. Since 2003, TREC has looked on definitional questions, which needs the complex and long answers. TREC introduced passage retrieval for question answering in the genomics domain. Automatically analyzing clinical questions is an important step towards answering clinical questions. Physicians often ask complex and verbose questions comprising a wide variety of types. Based on the objectives of QA track at TREC, question answering development is mainly focusing on improvement in answer extraction performance. Some of well performing systems are Pubmed, MedQA, Askhermes. They work with the definitional question like “What is X?”, which requires the long and complex answers. These systems are made mainly for experts/physicians, which gets the related informational answers (possible tests, symptoms, precautions, etc.) about the disease, which helps them for better treatment to patient. Many systems are available in the field of medical question. The number of internet users is increasing exceptionally but the available systems are not developed for any general user/patient to provide any medicinal help for every health problem. These systems perform well but are ISSN: 2231-5381 limited to short queries (normally word length upto 30-35). So, there is a requirement of a system, which can work on general user/patient query, which may be informal, very long length but provides result for any disease. III. Our System We have prepared a system that provides relevant health information as answer of a user given diagnosis related queries. The long queries are preprocessed, searched in the web, retrieved pages are processed to extract the relevant information. The overall architecture of the system is given in Figure 1. The system consists of five core modules; question identification, question analysis, query generation, web search, information extraction and finalize the coherent answer. The individual modules are discussedbelow. Fig 1: System Architecture A. Question Identification: System is designed for performing health questions posed by any user and there is a possibility that the question does not belongs to that type. This system is closed-domain system and works for medical domain or health related questions, so it is important to identify the nature of question. This step mainly focuses on identification of question and detects whether a question posed by a user is a health question. System classifies the question into two categories: medical and non-medical. For example, “What are the symptoms of seizures” selected as medical question and “What medications are prescribed for seizures” selected as non- medical question. To achieve this task, we have collected a numerous of questions (posted by original users) from different websites belonging for both the categories: medical and non- medical. From the questions, we have extracted some specific n-gram. http://www.ijettjournal.org Page 194 International Journal of Engineering Trends and Technology (IJETT) – Volume 28 Number 4 - October 2015 These n-grams represent the probability of belongingness of the medical domain and nonmedical domain. When a user submits a question, the system searches for this specific n-gram and sums up this value. Based on these values, it makes the decision whether the question falls under medical domain and whether it should be accepted. This module accepts only the medical question and forward for further processing, otherwise rejects. B. Question Analysis: Once the question is accepted, it further performed for analysis. This part is of the system is the soul of the system because identification of medical terms is done, based on which final result depends. We extract the important data and classify according to their property, like disease, symptoms, sex, duration, etc., the toughest job of the project. We have observed that lots of questions are quite lengthy and many times user use informal English with many grammatical and spelling errors. So we run a set of preprocessing phases which rectify the question and extract the specific information like disease name, sex, age, symptoms. These information play vital role in later steps, so are represented as template. Every word in question is important, because based on these, information is searched and answer is prepared. So there should not be any spelling error. For spelling error identification and correction we have used English common word dictionary and disease name dictionary. We have used various relevant web-sources to prepare the disease and medicine name dictionaries. If a word is not there in any of the dictionaries, it is detected as a misspelled word. The spelling correction is done by using edit distance method. Next we aim to fill a template by extracting the appropriate entries from the user question. The template is containing fields, namely, disease name, symptom, sex, age, duration of suffering, past medicine history (if any). The task of template filing i.e. corresponding n-gram extraction is done by defining a set of rules. Here we take the help of the NLTK toolkit that provide basic NLP analysis of the text like parts-of-speech, parse information, Named Entity Recognition. Examples of the rules used for template extraction is given below. suffering from 20 years (yr, yrs, months): in this sequence, the combination of „CD‟ and „NN‟ or „NNS‟ tag word is selected as disease duration. from over 20 years (yr, yrs, months): in this sequence, the combination of „CD‟ and „NN‟ or „NNS‟ tag word is selected as disease duration. 5 years old: in this sequence, the combination of „CD‟ and „NN‟ or „NNS‟ tag word is selected as age. ISSN: 2231-5381 C. Web Search: When a website gets a sentence to sentence to answer, it selects only some specific words from it based on its strategy and generates the answer based on those selected words. So in order to get the best result on user question, we generate the query to minimize the question. The query can be formed by two ways: from top most frequent and relevant words/phrases or from the template information. We generate the query based on template information. Now we search the web using the generated query. Our objective is to extract the most important information that is relevant to the disease name. The webpages are not only containing the information regarding the disease as a text page, but also contains various advertisements, forms, irrelevant links, or the required information is spread in various inner links or tabs. And no uniformity is there in the structure. It is not easy to extract the required information from these webpages of various types. Additionally, the information that we are aiming to extract must be authentic, so that a user can use. So, we cannot perform search in any website (or medial website). These information should be authentic and regularly updated as medical. To achieve these we have made a list of trusted websites from where the information will be extracted. In order to compile the list we use various web sources and web reviews like http://medical.nettop20.com,www.top20sites.com. Our trusted website list of this domain is containing 36 websites. Some from the list is given in Table 1. Medlineplus.com Healthcentral.com Webmd.com Medindia.net Healthline.com Bettermedicine.com Drugs.com www.webdiagnosis.com www.medicalsymptoms.in medicinenet.com www.netdoctor.co.uk www.healthatoz.com www.mayohealth.org emedicine.medscape.com www.answermed.com www.healthanswers.com Table 1: Trusted Websites Now from the web search results we pick only the pages that belong to the trusted website list. D. Information Extraction: System accesses the webpage in the form of html and in order to select any sentence or word(s) their http://www.ijettjournal.org Page 195 International Journal of Engineering Trends and Technology (IJETT) – Volume 28 Number 4 - October 2015 related tags with their information are selected. But in start, some specific tag of irrelevant information are selected and removed. This helps to reduce the web page contents up-to a great extent but does not remove any of important information and makes easy in the identification of relevant information. With the selection of specific tag, system selects the most appropriate sentences from the webpage. System extracts information like dosage of use, etc. In order to achieve more efficient result, currently we are searching the question query over two website and their results can be used in some manner to get more specific results. The searching of query in more than one website holds better result because sometimes a website fails to provide any information for a disease and in that case other website result can be helpful. In the selected webpages, there may a lot of contents other than the required information. Same type of information can be available in more than one webpages. System should not present all same type of information as result but only unique data. So we have made an internal list of name of required information and after extraction of any information, that name is removed. This helps to search the remaining information from other websites in less time. 4. Result and Discussion We have worked on some set of complex questions and these questions were from online users and some common users. We were performed these questions to get the relevant health information with their relevant instructions. For example, I am suffering from bipolar disorder since year 2000 ,at that time i got a maniac attack ,after that i did not get any episodes for next 3-4 years then in year 2006 i got depression ,and in year 2007 again depression ,then in year 2012 out of the blue moon i got maniac attack. since then i am on medication depakote + licab XL daily at night .I would like to know whether continuing the tablets for long will have any side effects ,and whether this is the best treatment ? The system accepts this question as medical diagnosis question and process. During the analysis phase, system identifies the important words like bipolar disorder as disease name, 15 years as disease duration and maniac attack, depression as additional symptom. In the phase, system creates a short query based on this information like bipolar disorder treatment with headache, bipolar disorder medicine. This query is searched into different medical websites to get the health information. Currently, we are working with two websites: drugs.com and healthline.com. System checks many pages of web results but selects only the specific web link. From the different webpages result, system only retrieves the health information ISSN: 2231-5381 like Test required, Precautions, Disease symptomsand Home-remedies. Our system works better but with some limitations. If the system fails to identify the disease name, then there will be no result. For example, I have massive body swelling in my legs feet and stomach, I'm not on any current medication that could be side effects. I'm weighing 5lbs more then my regular weight. This has been going on for approximately 6 days. Should I seek medical attention? System accepts this question but fails to produce the result, because unable to identify the disease name. We have performed our system based on the measurement of relevant answer. We are not considering any other measurement. Currently, the development of our system is in an early stage, and the data resources it has indexed are very limited. Selection of the final prescribed medicine would be more efficient with the extension of more data resources. Moreover, system works perfect on complex medical question. Currently, system is giving limited result; providing only health information but not the medicines (which may be very best prescribed medicine or supplement required). References [1] Cao Y., Liu F., Simpson P., Antieau L., Bennett A., Cimino J., Ely J., Yu H. (2011). AskHERMES: An Online Question Answering System for Complex Clinical Questions. Journal of Biomedical Informatics 44 (2011), 277–288. [2] Spiegelhalter D.J., Philip Dawid A., Lauritzen S.L. and Cowell R.G.1993. Bayesian analysis in expert systems. Statistical Science, 8(3), 219-283. [3] Voorhees EM. 1999. The TREC-8 question answering track report. In: Proceedings of TREC. [4] J. Ely, J. Osheroff, M. Ebell, G. Bergus, B. Levy,M. Chambliss, and E. Evans. 1999. Analysis of questions asked by family doctors regarding patient care. BMJ, 319, 358– 361. [5] Yu H., and Yong-gang Cao. 2008. Automatically Extracting Information Needs from Ad Hoc Clinical Questions. AMIA Annual Symposium Proceedings. 2008. 96–100. [6] Igor Kononenko.2001.Machine learning for medical diagnosis: history, state of the art and perspective. Artificial Intelligence in Medicine. August 2001.Volume 23, Issue 1,Pages 89–109. [7] Blair-goldensohn S., R. Mckeown K., Schlaikjer A.H. 2003.A Hybrid Approach for QA Track Definitional Questions. In Proceeding of the 12th Annual Text Retrieval Conference.2006. [8] Hersh W, Cohen A, Ruslen L, Roberts P. 2007. In the TREC genomics track conference 2007. [9] Hersh W, Cohen A, Roberts P, Rekapalli H. 2006. In the TREC genomics track conference 2006. [10] Yu H, Lee M, Kaufman D, Ely J, Osheroff JA, Hripcsak G, Cimino J. 2007. Development, implementation, and a cognitive evaluation of a definitional question answering system for physicians.Journal of Biomedical Informatics. June 2007. 40 (3).236–251. [11] Cao Y.G., James J. Cimino, Ely J., Yu H. 2010. Automatically extracting information needs from complex clinical questions. Journal of Biomedical Informatics 43 (2010) 962–971. http://www.ijettjournal.org Page 196