MedicineAsk: An intelligent search facility for information about medicines Ricardo João Galamba de Sepúlveda Ferrão Thesis to obtain the Master of Science Degree in Information Systems and Computer Engineering Supervisors: Prof. Helena Isabel de Jesus Galhardas Prof. Maria Luísa Torres Ribeiro Marques da Silva Coheur Examination Committee Chairperson: Supervisor: Member of the Committee: Prof. Miguel Nuno Dias Alves Pupo Correia Prof. Helena Isabel de Jesus Galhardas Prof. Ana Teresa Correia de Freitas October 2014 ii Resumo é muito importante no campo de medicina. As interfaces em L´ngua O acesso rápido e fácil a informaçao uma das maneiras de aceder a este tipo de informaçao. O MedicineAsk é um protótipo de Natural sao sobre medicamentos e substancias software que procura responder a perguntas em Portugues activas. Foi concebido para ser fácil de usar tanto por pessoal médico como utilizadores comuns. As respostas perguntas sao obtidas através de informaçao previamente extra´da do Prontuário Terapeutico as do do módulo de Infarmed e armazenada numa base de dados relacional. Esta tese descreve a extensao processamento de L´ngua Natural do MedicineAsk. Focamo-nos em aumentar a quantidade de perguntas de utilizadores que é poss´vel responder. Em primeiro lugar, adicionámos técnicas de aprendizagem de perguntas usando Support Vector Machines. Em segundo lugar, foi imautomática para classicaçao plementado suporte para perguntas que incluem anáfora e elipses. Finalmente melhorámos o detector anterior do MedicineAsk. Realizámos uma validaçao sobre cada de sinónimos implementado na versao ao MedicinesAsk e identicámos as limitaçoes encontradas, sugerindo algumas soluçoes. nova adiçao melhorada do Processador de L´ngua Natural do MedicineAsk respondeu a 17% mais pergunA versao anterior do MedicineAsk, e ainda 5% mais perguntas ao tratar de anáforas. Esta tese tas que a versao relata também o estado da arte de sistemas de pergunta-resposta no dom´nio médico, de outros tipos web na área de medicina e de sistemas de recuperaçao de informaçao médica. de aplicaçoes de Anáforas. Palavras-chave: L´ngua Natural, Medicina, Support Vector Machines, Resoluçao iii iv Abstract Obtaining information quickly and easily is very important in the medical eld. Natural Language Interfaces are one way to access this kind of information. MedicineAsk is a prototype that seeks to answer Portuguese Natural Language questions about medicines and active substances. It was designed to be easy to use so that questions may be posed by both medical staff and common users. Questions are answered through information previously extracted from the Infarmed's Therapeutic Handbook and stored in a relational database. This thesis describes the extension of the Natural Language processing module of MedicineAsk. We focused on increasing the quantity of answerable user questions. First, we added machine learning techniques for question classication by using Support Vector Machines. Second, support for questions including anaphora and ellipsis has been implemented. Third, we extended the synonym detection feature implemented in the previous version of MedicineAsk. We performed a validation over each of the new MedicineAsk features. Our improved MedicineAsk NLI answered 17% more questions than the previous version of MedicineAsk, with a further 5% increase when handling anaphora. We identied current limitations of MedicineAsk and suggested some solutions. This document also shows the state of the art on medical domain question answering systems, on other types of web-based systems in the area of medicine and on information retrieval systems for medical information. Keywords: Natural Language, Medicine, Support Vector Machines, Anaphora Resolution v vi Contents Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 Introduction 1 1.1 Web-based system for querying data about medicines in Portuguese . . . . . . . . . . . . 3 1.2 Proposed solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Related Work 7 2.1 Medical Question Answering systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 General Domain Question Answering systems . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Medical Text Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Web-based Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5 Existing Algorithms for Anaphora Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Background 15 3.1 The MedicineAsk Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1.1 Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1.2 Natural Language Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.3 Validation of the MedicineAsk prototype . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 LUP: A Language Understanding Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4 Improvements to the Natural Language Interface module 25 4.1 Automatic question classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1.1 Answering MedicineAsk questions with SVM . . . . . . . . . . . . . . . . . . . . . 26 4.1.2 Adding SVM to MedicineAsk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Anaphora and Ellipsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 vii 4.2.1 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Synonyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5 Validation 35 5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Rule-based versus Automatic Question Classication . . . . . . . . . . . . . . . . . . . . 38 5.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2.2 Discussion and Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.3 Integrating SVM into the MedicineAsk NLI . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.4 First Experiment - Combining Question Answering approaches . . . . . . . . . . . . . . . 42 5.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.4.2 Discussion and Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.5 Second Experiment - Increasing the training data . . . . . . . . . . . . . . . . . . . . . . . 44 5.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.5.2 Discussion and Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.6 Anaphora Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.6.2 Discussion and Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.7 Testing Test Corpus B with anaphora resolution . . . . . . . . . . . . . . . . . . . . . . . . 48 5.7.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.7.2 Discussion and Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6 Conclusions 51 6.1 Summary and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.2.1 Additional question types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.2.2 Additional strategies for question answering technique combinations . . . . . . . . 53 6.2.3 Addressing common user mistakes . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.2.4 User evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.2.5 Question type anaphora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.2.6 Updates to Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.2.7 MedicineAsk on mobile platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.2.8 Analysing Portuguese NL questions with information in other languages . . . . . . 56 Bibliography 61 A Questionnaire used to obtain test corpura 63 viii B Dictionary used to identify named medical entities in a user question (Excerpt) 67 C Training Corpus B Excerpt 69 ix x List of Tables 2.1 Question Analysis steps in MEANS. The input of a step is the output resulting from the previous step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 5.1 Number of questions for each question type in the Training Corpus A. . . . . . . . . . . . 36 5.2 Number of user questions and expected answer type for each scenario for Test Corpus A 36 5.3 Number of questions for each question type in Training Corpus B. . . . . . . . . . . . . . . 37 5.4 Number of user questions and expected answer type for each scenario for Test Corpus B 37 5.5 Percentage of questions with anaphora correctly classied . . . . . . . . . . . . . . . . . . 47 xi xii List of Figures 1.1 Results of searching for medicines for headache using the eMedicine website. . . . . . . 2 3.1 Architecture of MedicineAsk Information Extraction module. Image taken from [Mendes, 2011]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Architecture of MedicineAsk Natural Language Interface module. Image taken from [Mendes, 2011]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 LUP architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1 Strategy 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 Strategy 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 Strategy 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4 Answering a question with no anaphora and storing its information in the Antecedent Storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.5 Solving anaphora in a question. Paracetamol is ignored because it is an active substance. 31 5.1 Scenario used to encourage users to use anaphora. . . . . . . . . . . . . . . . . . . . . . 38 5.2 Percentage of correctly classied questions by scenario for common users after improvements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.3 Percentage of correctly classied questions by scenario for users in the medical eld after improvements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Percentage of correctly classied questions by scenario for all users after improvements. 39 40 5.5 Percentage of correctly classied questions by scenario for common users, using string similarity techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.6 Percentage of correctly classied questions by scenario for users in the medical eld, using string similarity techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.7 Percentage of correctly classied questions by scenario for common users for the First Experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.8 Percentage of correctly classied questions by scenario for users in the medical eld for the First Experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.9 Percentage of correctly classied questions by scenario for all users for the First Experiment. 44 5.10 Percentage of correctly classied questions by scenario for all users for the Second Experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 45 5.11 Percentage of correctly classied questions by scenario for all users for the Second Experiment with Anaphora Resolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv 48 Chapter 1 Introduction Every day, the data available on the Internet, in particular medical information, increases signicantly. Medical staff and common users may have interest in accessing this information. After visiting their doctors, common users may want to complement the information they received about their diseases and medication. Medical staff, such as doctors, nurses and medicine students, may want to consult the information available on the web to clarify doubts, conrm a medication or to stay updated with the latest information. Due to its nature, medical information often has to be accessed quickly. For example, a doctor may need to quickly access information in order to treat an emergency patient. A common user may have lost the information regarding one of his medicines and thus needs to urgently access the correct dosage for that medicine. For these reasons, on-line medical information should be available through an interface that is fast and easy to use by most people. There is a large amount of medical information of many different types and formats currently available on-line. This information is contained in either databases of medicines and diseases or collections of papers containing important medical ndings. Currently, to access this on-line information, users must either do it by hand (i.e., by manually reading a large volume of information and/or navigate through an index), learn a language to query the data (e.g., learn SQL to query a database with medical information), or use a keyword-based system. Manually navigating through medical information can be a complex task for a common user because the medical terms used to organize the information are often too complicated for the common user to understand. Learning a language to query data stored in a database is also something that cannot be expected from a common user. In what concerns keywordbased systems, keywords are considered without taking into account the connection between them. For example, if a user searches for medicines for headache, the system will retrieve any document referencing fever, even if that term is only mentioned briey. This makes it difcult for the user to nd the information he/she is looking for. To illustrate this issue, Figure 1.1 shows the result of searching for medicines for headache on eMedicine1 , a website which allows users to search for medical information through keyword-based search. Note that even after ltering the results to show only results related to Drugs there are still 1180 results. The terms that are used on the results are also not trivial. For a 1 http://emedicine.medscape.com/. 1 common user, it would not be easy to nd the information required among these results. Figure 1.1: Results of searching for medicines for headache using the eMedicine website. One alternative to access medical information is through a Natural Language Interface (NLI). It has been shown that users often prefer Natural Language Interfaces over other methods such as keywordbased search [Kaufmann and Bernstein, 2007]. While some NLIs have been developed for the medical eld [Ben Abacha, 2012], they are still relatively new and none is available for the Portuguese language. This means that a Portuguese user who wants to access medical information available on-line must either use a system in a foreign language, which the user may not be uent in, or use a traditional method, such as keyword-based search, like the one available in the Infarmed website2 . Among various on-line 3 (Therapeutic Handbook), which publishes data services, Infarmed provides the Prontuário Terapeutico about medicines and active substances approved to be sold in the Portuguese market. From hence forth, we refer to this information source as the Infarmed website. The Infarmed website contains data about medicines and active substances, such as their indications, adverse reactions, precautions, dosages and prices, among others. The user may access the data available on the Infarmed website by navigating an hierarchical index (which works similarly to the index of a book) or by using a keyword-based search. As previously mentioned, navigating an index requires some knowledge of medical terms, and keyword-based search can provide incorrect or irrelevant results. 2 http://www.infarmed.pt/portal/page/portal/INFARMED. 3 http://www.infarmed.pt/prontuario/index.php. 2 1.1 Web-based system for querying data about medicines in Portuguese MedicineAsk is a question-answering prototype for information about medicines and active substances. This prototype was developed in the context of two master thesis [Bastos, 2009] [Mendes, 2011]. MedicineAsk intends to solve the problems of the Infarmed website previously explained, by providing an NLI for the Infarmed website. The idea is that, by using an NLI, both common users and medical staff will be able to access the data on the Infarmed website in an easier and faster way. The MedicineAsk architecture is divided into two modules: Information Extraction and a Natural Language Interface. The Information Extraction module is responsible for extracting information from the Infarmed website, processing it and inserting it into a relational database. The Natural Language Interface enables users to access the data on the Infarmed website. Users can query about specic data regarding active substances and medicines, such as the price of a specic medicine or the indications of an active substance. It is also possible to ask more sophisticated questions, such as medicines for a certain disease that do not have precautions with a given medical condition. The second version of MedicineAsk improved the rst one since it was able to answer a larger number of questions. 1.2 Proposed solution The NLI of MedicineAsk still has limitations regarding what a user can ask the system. Questions that contain anaphora or ellipsis cannot be answered. In other words, if a question makes a reference to a medical entity mentioned on a previous question, MedicineAsk cannot answer that question. Furthermore, the previous version of MedicineAsk uses rule-based techniques to answer questions. While these techniques can achieve good results, they also require a user's question to match a certain pattern, or to contain certain keywords. Machine learning techniques suffer less from these issues, and possibly achieve better results than the techniques used by the previous version of MedicineAsk. Previously, some MedicineAsk features were not nished or fully explored due to time constraints. Namely, a synonym detection feature was added to MedicineAsk, but due to no comprehensive list of synonyms being found at the time this feature was not fully explored. 1.3 Contributions This thesis resulted in a paper and poster that were published in the 10th International Conference on Data Integration in the Life Sciences. The main contributions of this thesis are: 1) Incorporation of automatic question classication techniques for question answering The techniques used in MedicineAsk NLI module are rule-based and keyword spotting. Rule-based techniques have the advantage of providing very good results if a user poses a question that exactly 3 matches the patterns specied by the rules. However, it is common that a user poses a question in a way that the developers of the rules did not think of. A keyword spotting technique classies a question based on the presence of certain keywords. A question about dosages would have the keyword dose for example. This technique relies on dictionaries which contain these keywords. The contents of these dictionaries are either manually inserted or automatically collected from a given resource. Dictionaries manually constructed are difcult to expand. Dictionaries that were collected by, for example, extracting all the medicine names from a website and inserting them into the dictionary, may contain incorrect keywords. This can lead to questions being wrongly classied. Machine Learning techniques can be used to minimize the developers work. These techniques learn how to analyse a question without following strict rules. For this reason, Machine Learning techniques can classify user questions without having to match any developer-made patterns. We integrate Support Vector Machine in MedicineAsk and observe how it compares to the previous version of MedicineAsk. We performed tests to compare different combination strategies, such as using only SVM to interpret questions versus combining SVM with rule-based and keyword spotting techniques. The results obtained are reported in Chapter 5 show that SVM does in fact improve the question answering capabilities of MedicineAsk. The way machine learning techniques are integrated into MedicineAsk is by using the LUP system. LUP is a platform that can be used to run different Natural Language Understanding techniques (including SVMs) and compare the results of those techniques. 2) Anaphora and Ellipsis The current version of MedicineAsk does not support questions which use anaphora and ellipsis. An anaphora occurs when there is a reference in a sentence to a previously mentioned entity. For example, consider that a user inputs the following two questions: What are the indications of Mizollen? And the adverse reactions of that medicine?. In the second question, of that medicine references to Mizollen, but MedicineAsk does not know this and thus, will not be able to answer this question. Ellipsis is a special type of anaphora where the word referencing the previous entity is implicit. For example, in the pair of questions What are the indications of paracetamol? What are the adverse reactions?. In this work we add support for these types of questions by keeping a short history of questions. In the case of anaphora, MedicineAsk analyses the history and chooses the most recent entity to answer the question with anaphora. This part of the work was made portable, meaning it can be used for other question answering environments that represent questions in a similar way to MedicineAsk. We have performed tests to determine if anaphora resolution brings improvements to MedicineAsk and found the results favourable. 3) Synonym detection In the previous version of MedicineAsk, a list of synonyms of medical conditions, that was stored in the database, was implemented. This feature is useful because websites such as Infarmed often use terms for medical conditions that are not common language and thus not used by the common user. For example most people say febre to refer fever in Portuguese but in the medical domain the term 4 pirexia is often used instead. Without synonyms users could potentially think no information about fever was available in our system because our system is only aware of pirexia. We enriched the list of synonyms stored in the database with synonyms of medical terms extracted from the Priberam website4 , which is an online portuguese dictionary. 1.4 Thesis Outline This document is organized into six chapters. Chapter 2 details the related work, in particular describes other question answering systems in the eld of medicine. Chapter 3 explains the previous versions of MedicineAsk and the LUP system. Chapter 4 describes the improvements made to the NLI module of MedicineAsk in the scope of this thesis. Chapter 5 describes the validation performed on the new NLI module of MedicineAsk. Finally, Chapter 6 concludes and summarises future work. 4 http://www.priberam.pt/DLPO/. 5 6 Chapter 2 Related Work This chapter describes different types of systems related to Natural Language and/or medicine. Section 2.1 details a question answering system for the medical domain, which has a goal similar to MedicineAsk. Section 2.2 details NaLIR, which translates user natural language queries into queries to a relational database, for the general domain. Section 2.3 describes cTAKES, an information extraction system for the medical domain, with some similarities to MedicineAsk Information Extraction module. Section 2.4 lists several online systems in the medical eld that are not question answering systems, but still have similarities to MedicineAsk. Finally, Section 2.5 lists some existing algorithms for anaphora resolution. 2.1 Medical Question Answering systems MEANS [Ben Abacha, 2012] is a question-answering system for the medical domain. Its purpose is to obtain an answer to Natural Language questions based on medical information obtained from publicly available resources. Table 2.1 illustrates the steps taken to analyse a user question in MEANS. A short explanation of these steps follows. Step Output 1) Question Type Identication 2) Expected Answer Type Identication 3) Question Simplication 4) Medical Entity Recognition WH or Yes/No Question EAT Simplied question Identied Medical Entities 5) Relations Extraction 6) SPARQL Query Construction Identied Relations SPARQL Query Example - What is the best treatment for pneumonia? WH Question EAT = Treatment new Q = ANSWER for pneumonia. ANSWER for <PB >pneumonia </PB >. (PB = Problem) treats(ANSWER, PB), EAT = Treatment - Table 2.1: Question Analysis steps in MEANS. The input of a step is the output resulting from the previous step. In the rst step, called Question Type Identication, questions are divided into Yes/No or WH questions. WH questions start with How, What, Which or When and have an answer with more information than a simple Yes or No. The question type is determined by applying a simple set of rules to the input questions (e.g. if the question begins with a How, a What, a Which, or a When then it is a WH question). 7 The second step, Expected Answer Type Identication (EAT), is necessary when dealing with WH questions to discover what type of answer will be returned. The EAT is determined using lexical patterns previously built by hand for each question type. WH questions are matched against these patterns and any EAT found is saved. The possible answer types were dened by the authors and are M EDICAL P ROBLEM, T REATMENT, M EDICAL T EST, S IGN OR S YMPTOM, D RUG, F OOD and PATIENT. Multiple Ex- pected Answer Types for a single question are saved in order to answer multiple focus questions such as What are the symptoms and treatments for pneumonia?. In this particular case, the expected answer types are S IGN OR S YMPTOM and T REATMENT. The third step, Question Simplication, applies a simplication to the question to improve the analysis of the question. Interrogative pronouns are replaced by the `ANSWER' keyword (e.g., What is the best treatment for pneumonia? becomes ANSWER for pneumonia). This `ANSWER' keyword is a special word that the system will ignore in later steps. The question is also turned into its afrmative form. Treatment is the EAT of the question, identied in the Expected Answer Type Identication step, so it is no longer needed while analysing the question. The simplication is applied to prevent noise in later steps. In this example, the word treatment would have been identied as a medical entity during the Medical Entity Recognition step. This extra entity would have caused interference when searching for relations, since the Relation Extraction step does not need to know the relation between treatment and pneumonia. The fourth step, Medical Entity Recognition, focuses on detecting and classifying medical terms into seven different categories, which also match the possible Expected Answer Types: P ROBLEM, T REATMENT , T EST, S IGN OR S YMPTOM, D RUG, F OOD and PATIENT. To nd medical entities, a rule-based method called MetaMap+ and a machine learning method that uses Conditional Random Fields are used. The MetaMap+ method uses a tool called MetaMap [Aronson, 2001]. MetaMap is an online tool1 to nd and classify concepts in input text by mapping them to concepts from the Unied Medical Language System2 (UMLS). UMLS is a very large resource for biomedical science. It contains a Metathesaurus with millions of concepts and concept names, a semantic network of semantic types and relationships, and a Specialist Lexicon which contains syntactic, morphological and orthographic information. The authors of MEANS identied some limitations on MetaMap [Ben Abacha and Zweigenbaum, 2011a]. For example, some words were mistakenly treated as medical concepts (such as best, normal, take and reduce). Another limitation is that MetaMap can propose more than one answer for a single word. To deal with these issues, MetaMap+ was proposed [Ben Abacha and Zweigenbaum, 2011a]. The machine learning method used for Medical Entity Recognition in MEANS is called BIO-CRF-H. It uses a Conditional Random Field classier to classify the concepts. Among others, it uses as features POS tags from TreeTagger, the semantic category of the word resulting from the MetaMap+ method, and B-I-O tags that indicate what words of a sentence are the Beginning of a concept (B tag), Inside of a concept (I tag) or Outside of a concept (O tag). For example, for a Treatment (T) concept in a given 1 http://metamap.nlm.nih.gov/. 2 http://www.nlm.nih.gov/research/umls/. 8 sentence, the rst (and possibly the only) word of that concept would be tagged B-T. Any other words that follow that same concept would be tagged I-T. Any words not inside this treatment concept or any other concepts are tagged O. The BIO-CRF-H approach uses the annotated corpus provided for the i2b2/VA 20103 challenge [Uzuner et al., 2011] for training. The fth step is Relation Extraction. A relation identies what is the relationship between two medical entities in a sentence and is very important to determine answers to medical questions. Seven relations are extracted: P H TREATS , COMPLICATES , PREVENTS , CAUSES , DIAGNOSES , D H D (Drug has dose) and SS (Problem has signs or symptoms). In the example Aspirin cures headaches, we have the relation TREATS between the concepts aspirin and headache. The relation extraction step uses a machine learning method with SVM. A rule-based method is used whenever there are not enough examples in the training corpus to properly train the machine learning method [Ben Abacha and Zweigenbaum, 2011b]. The rule-based method to extract relations uses a set of patterns that were manually built from analysing abstracts in MEDLINE4 (Medical Literature Analysis and Retrieval System Online). MEDLINE is a database of medical articles and information about them. The machine learning based method uses SVM (using LIBSVM) which was trained with the same i2b2/VA 2010 corpus that was used for the Conditional Random Field method to extract medical entities. For each pair of words, it determines if it has one of the seven relations previously mentioned (TREATS, DIAGNOSES , D H D and P H COMPLICATES , PREVENTS , CAUSES , SS) or no relation (thus eight possible categories). Finally, the sixth step, SPARQL Query Construction, constructs a SPARQL query [Ben Abacha and Zweigenbaum, 2012]. Before MEANS can answer any question, an RDF graph is built, through a separate process. A medical corpus is analysed and, similarly to how questions are analysed, medical entities and relations are extracted from the medical texts. The extracted information is annotated in RDF and inserted into an online RDF database. This RDF graph includes the medical entities and their relations. In this step, SPARQL Query Construction, questions are translated into a SPARQL query that is then posed to the RDF database. MEANS has some similarities with MedicineAsk. MedicineAsk answers questions about certain medicines, active substances and what they treat. It can answer questions like What medicine is good for pneumonia? and What active substances may cause headaches?. MEANS can answer this kind of questions as well but it also answers questions outside the scope of MedicineAsk. It can answer questions such as Who should get hepatitis A vaccination? and What is the best way to distinguish type 1 and 2 diabetes?. Furthermore, for MEANS, a set of documents were annotated in RDF format to build a graph which is then traversed to answer queries using SPARQL. MedicineAsk extracts information from the Infarmed website and stores it in a relational database which is then queried using SQL. Also, MEANS was made for English while MedicineAsk answers questions in Portuguese. Finally, the combined use of rule-based methods and machine learning methods to answer medical questions has some similarities to our goal of using both rule-based and machine learning techniques on MedicineAsk 3 http://www.i2b2.org/NLP/Relations/. 4 http://www.nlm.nih.gov/pubs/factsheets/medline.html. 9 for the same purpose. 2.2 General Domain Question Answering systems NaLIR (Natural Language Interface for Relational databases) [Li and Jagadish, 2014] is an interactive natural language query interface for relational databases. To better understand a user's natural language, NaLIR interacts with the user to solve ambiguities. NaLIR explains to the user how it interprets a query so that the user better understands any errors made by NaLIR. For any ambiguity, the system chooses a default solution, but provides the user with a multiple choice with other possible solutions. The user can then resolve this ambiguity, if NaLIR's default option was not the correct one. When NaLIR explains to the user how it understood the user's query, it must use a representation that both the user and NaLIR can easily understand. For that purpose, the developers proposed a data structure called a Query Tree. A query tree is between a linguistic parse tree and an SQL query. The developers claim users can understand query trees better than SQL queries. They also claim that NaLIR can almost always translate a user-veried query tree into an SQL statement. NaLIR has three components. The rst component transforms a user's natural language question into a query tree. The second component interacts with the user in order to verify and make any necessary corrections to the query tree. The third component transforms the query tree into an SQL query. NaLIR was implemented as a stand-alone interface that can work on top of any relational database. The goal of NaLIR was to enable common users to obtain correct information from a database, using complex queries without the need of knowing SQL or the schema of the database. A validation with real users was performed on the system where both the quality of the results and the usability of the system were measured. The data set used was the data set of Microsoft Academic Search (MAS). The MAS website has an interface to this data set. NaLIR was tested against this MAS website. Because the interaction with the user was the central innovation in NaLIR, the developers also experimented with a version of the system which did not interact with the user. Of the 98 queries per system, users were able to obtain correct answers for 66 queries using NaLIR without user interaction, 88 queries for NaLIR with user interaction and 56 queries using the MAS website. They reported that users found the MAS website useful for browsing data but hard to use in complex query scenarios. Users found NaLIR to be easy to use, giving high levels of satisfaction, but the system showed to be weaker when handling complex scenarios. Both NaLIR and MedicineAsk are NLIDBs (natural language interfaces to databases). NaLIR was designed to work for multiple domains, while MedicineAsk is solely for information about medicines and active substances. We nd the user interaction to be noteworthy, and possibly useful for the future of MedicineAsk. 10 2.3 Medical Text Information Extraction cTAKES [Savova et al., 2010] is a tool that extracts information from free text in electronic medical records using Natural Language Processing techniques. It is an extensible and modular system. It uses the Unstructured Information Management Architecture (UIMA)5 and the OpenNLP6 natural language processing toolkit to annotate the extracted information. cTAKES has recently become a top level Apache project7 . cTAKES encloses several different rule-based and machine learning methods. In cTAKES, several components are executed in a pipeline, some of them being optional. Its components include sentence boundary detection, rule-based tokenization, morphologic normalization, POS tagging, shallow parsing and Named Entity Recognition. For example, to extract information from a document, the text is rst split into sentences and then a tokenizer is used to split the sentences into tokens. These tokens can then be analysed by other components. More modules have been added over time such as a Semantic Role Labeller, Coreference resolver, Relation extractor and others. cTAKES is not a question answering system, but it is related to MedicineAsk. cTAKES extracts structured information from unstructured data, sharing the goal of MedicineAsk Information Extraction module. Both MEANS, cTAKES and MedicineAsk use a combination of rule-based and machine learning methods for their respective purposes. 2.4 Web-based Systems When practising medicine, doctors and medical staff need to quickly access medical information. Several web-based systems have been developed to ease the user's search of medical information so that answers can be found quickly. Some of these systems also have a mobile version, combining the vast amount of information found on the web with the portability of paper charts, allowing the discovery of answers without leaving the patient's side. The systems Epocrates Online8 , eMedicine9 and Drugs.com10 , further detailed in [Mendes, 2011], are similar medicine and disease reference systems. They offer a web-based user interface through which users can search a vast amount of medical information by disease and by medicine (by name or class). Searching medicines by class means the user navigates through several medicine classes and subclasses until he/she nds the one he/she wants. This functionality is useful in cases where the user does not know the name of a particular medicine. When searching medicines by name, the user inserts the name of a medicine and then receives all the relevant available information about that medicine. The three systems also have a medicine interaction check. Using this feature, a user can insert two or more medicines and nd interactions between them. There are many possible interactions between 5 http://uima.apache.org/. 6 http://opennlp.apache.org/index.html. 7 http://ctakes.apache.org/. 8 https://online.epocrates.com/home. 9 http://emedicine.medscape.com/. 10 http://www.drugs.com/. 11 medicines, and these interactions can be very dangerous, making this feature very valuable. Personal Health Records (PHRs) are systems that enable a patient to create a prole where he/she can directly add information about himself (such as allergies, chronic diseases, family history, etc.), results from laboratory analysis or even readings from devices that monitor the patient's health like pedometers, heart rate monitors and even smart phones. Examples of Personal Health Records include Google Health (now retired), Microsoft HealthVault11 , Dossia12 and Indivo13 . In the previously mentioned web-based systems, medical staff add large amounts of data about general medicine that users can later access. In PHRs, the users are the ones who insert data about themselves. PHRs also include a vast amount of health information that users can access and they can also have extra services that can be used with the patient's information. For example, a PHR can indicate all the patient's interactions between the medicines he/she is currently using. It is also possible to manage doctor appointments and reminders, or to track the progress of a patient's diet. Having all of a patient's information in a centralized location makes it easier to consult this information, compared to having it spread through several paper les. Patients can then allow doctors to access this information so they can make better and more informed decisions. 2.5 Existing Algorithms for Anaphora Resolution An anaphora occurs when the interpretation of a given expression depends on another expression in the same context. For example in John nished the race. He came in second place., He is called the anaphor, and the expression it refers to, John, is called the antecedent. Automatically understanding that John came in second place from analysing this text requires a relationship between two words in different sentences to be established. Also, if the second sentence is analysed but the rst one is not then the second sentence will not be understood. The rst sentence is required to understand who he is. An ellipsis is a specic case of anaphora (also called zero anaphora) where there is no anaphor and the reference is implicit. For example in They went on a trip and enjoyed themselves, they is omitted between and and enjoyed. Using anaphora is useful to avoid repeating a term very often or to avoid writing a long and/or complex name many times, referring to it through simpler terms. There are many different kinds of anaphora that we will not detail in this work such as pronominal anaphora, noun anaphora, verb anaphora and adverb anaphora. As an example of pronomial anaphora we have Mary gave Joe a chocolate. He found it very tasty. There are also other types, such as denite description anaphora, where the anaphora contains additional information (e.g. Michael won the championship. It was Smith's fourth victory, the anaphor Smith tells us Michael's last name.). An indirect anaphora requires background knowledge to resolve (e.g. John knocked the chess board sending the pieces ying away, the anaphor is pieces and the antecedent is chess board, it requires 11 https://www.healthvault.com/. 12 http://www.dossia.com/. 13 http://indivohealth.org/. 12 the information that chess is played with pieces). Anaphora resolution consists of mapping the anaphor to the corresponding correct antecedent. This task is not trivial and even humans can have trouble resolving anaphora. For example in Michael likes being tall but his girlfriend doesn't. it is unclear whether Michael's girlfriend dislikes being tall herself or dislikes that her boyfriend is tall. For machines, due to the ambiguity of natural language, it is even more difcult to nd the correct mapping between an anaphor and its antecedent. Anaphora resolution is a topic that has been studied for many years. Hobbs' algorithm [Hobbs, 1978] is one of the most known algorithms, and it intends to resolve pronoun anaphora. Hobbs' algorithm traverses the syntactic tree of the sentence the anaphora is located on, in a particular order, searching for possible antecedents using a breadth-rst strategy. Antecedents that do not match the gender and number of the anaphor are not considered. Mitkov's algorithm [Mitkov, 2002] is an algorithm to solve pronominal anaphora. After nding an anaphora, but before resolving it, the algorithm analyses the current sentence and the two sentences before the current one, to extract any noun phrases preceding the anaphora. A list of antecedent candidates is built in this step. Afterwards, a lter is applied to this list of candidates, removing any antecedents that do not match the gender and number of the anaphora, similarly to Hobbs' algorithm. Finally, a list of antecedent indicators are applied to the candidates. These antecedent indicators consist of several different kinds of heuristics that raise or lower the score of each individual antecedent candidate. For example, the rst antecedent noun phrase (NP) in a sentence receives a score of +2, while any indenite NP (an NP that is not specic, such as the article an in English) receives a score of -1. Candidates with a higher score are more likely to be the correct antecedent for the anaphora. In the end, the candidate with the highest score is picked as the correct answer. An adaptation of Mitkov's algorithm for Brazilian Portuguese was proposed in [Chaves and Rino, 2008]. It works similarly to Mitkov's algorithm, changing only the antecedent indicators used to score the antecedent candidates. This version uses ve antecedent indicators from Mitkov's algorithm and adds three new ones. These new antecedent indicators were added so that the algorithm would be better suited for the Portuguese language. The new antecedent indicators are: Syntactic Parallelism, A score of +1 is given to the NP that has the same syntactic function as the corresponding anaphor. Nearest NP - A score of +1 is given to the nearest NP in relation to the anaphor. The authors state the nearest candidate is often the correct antecedent. Proper Noun - A score of +1 is given to proper nouns. The authors observed that proper noun candidates tended to be chosen as the correct antecedent. STRING [Mamede et al., 2012] is a Natural Language Processing chain for the Portuguese Language. It can process input text through basic NLP tasks and it includes modules that enable the resolution of more complex tasks. Among the modules of STRING is an Anaphora Resolution module (ARM 2.0) [Marques, 2013]. The Anaphora Resolution Module uses a hybrid method. It uses a rule-based approach to nd each anaphor and its antecedent candidates. Then it selects the most likely antecedent 13 candidate through a machine learning method that receives as input the anaphors and their respective list of candidates. The machine learning algorithm used for this task was the Expectation-Maximization (EM) algorithm [Dempster et al., 1977]. It requires a corpus annotated with anaphoric relations. First the anaphors have to be identied. To identify anaphors in a question, ARM 2.0 uses syntactic information in the question obtained from STRING. It then observes the POS tags of each token and where they are located in the sentence, following a set of rules to discover each anaphor (e.g. articles not incorporated in NPs or PPs are classied as anaphors). Second, the previous questions are analysed to build an antecedent candidate list. A similar method is applied to construct the candidate list (e.g. nouns that are heads of NPs or PPs can be candidates for antecedents). ARM 2.0 only looks for antecedent candidates in the expression with the anaphor and in a two sentence window before the anaphor. Third, the antecedent candidate list is ordered by rank by the machine learning algorithm, from most likely to least likely. The most likely candidate is chosen as the antecedent of the identied anaphor. Each candidate in the list is an anaphor-antecedent pair. The EM algorithm uses several features to dene each anaphor-antecedent pair, namely is antecedent which determines if the antecedent is the correct one for this anaphor. Other features include the gender and the number of both the anaphor and the antecedent, the distance in sentences between the anaphor and the candidate, the POS of the anaphor and the antecedent, and others. The EM algorithm provides the likelihood of an anaphor-candidate pair being part of a given cluster. The system runs the EM algorithm against two clusters. One cluster represents the candidate being the antecedent for the anaphor, and the other stating the opposite. Using this method, it is possible to obtain the likelihood of each antecedent being the correct antecedent for the anaphor, and thus build a ranking to determine the most likely answer. Comparing the results of the systems detailed above is not possible because each system uses a different test corpus. Each system also focuses on different types of anaphora and some of them aim to be automatic while others rely on man-made data. As an example, Hobbs achieves a success rate of 91.7% while ARM 2.0 achieves a recall of 58.98%, a precision of 50.30% and an f-measure of 54.30% [Marques, 2013]. 14 Chapter 3 Background This section describes the systems used for the execution of this thesis. Section 3.1 describes the previous version of MedicineAsk. Section 3.2 describes the LUP system which is a platform that can be used to run different Natural Language Understanding techniques (including SVMs) and compare the results of those techniques. 3.1 The MedicineAsk Prototype MedicineAsk is a prototype capable of answering Natural Language medical questions about medicines and active substances. MedicineAsk is divided into two main modules: information extraction and natural language interface. The information extraction module extracts data from the Infarmed website. This extracted data is processed and stored in a relational database. Then, a web-interface enables a user to pose a natural language question. The natural language interface module analyses the question, translates it into an SQL query and gets an answer from the relational database. The answer is then delivered to the user via the same web-interface. This section details each of MedicineAsk modules. 3.1.1 Information Extraction The Information Extraction module aims at extracting data from the Infarmed website, processing this data and storing it in a database so that the Natural Language Interface module can access it to answer questions posed by the users. Figure 3.1 shows the architecture of this module. The Information Extraction module is divided into four components: web data extraction, processing of entity references, annotation, and the database. The web data extraction component navigates through the Infarmed website and extracts data. While some types of data can be directly added to the database (because they are already structured), other kinds of data require pre-processing. The processing of entity references component and the annotation component handle this kind of data so that it can then be inserted in the database. 15 Figure 3.1: Architecture of MedicineAsk Information Extraction module. Image taken from [Mendes, 2011]. Web data extraction The data available in the Infarmed website is structured like the index of a book. The data is organized into several chapters and sub-chapters where each one corresponds to a type or sub-type of active substance. Inside these chapters and sub-chapters is the information about those specic substances. The extracted information consists of the Infarmed hierarchy structure, i.e. the chapter data, the substance data and the medication data for each substance. The MedicineAsk web data extraction component recursively traverses all chapters, sub-chapters and active substances published on the Infarmed website. It uses XPath1 and XQuery2 expressions to lter, extract and store the data. In order to keep the Infarmed website hierarchical structure, the extracted chapters are represented as folders. For instance, chapter 1.1.1 and chapter 1.1.2 are folders inside another folder called chapter 1.1. Chapters and sub-chapters can contain text describing their contents. This text includes non-structured information about the chapter's substances (historical information, molecular information, etc.) which is stored inside the chapter's folder in an XML le with the name chapter name info.XML. Another kind of non-structured information concerns indications, precautions, etc., shared by all the substances inside that chapter. This information is stored in another XML le entitled chapter name indicacoes.XML. Active substances have two different kinds of data associated. The non-structured data consists of data about indications, precautions, etc. and the structured data contains information about medicines that contain this substance. Both are kept in XML les, the non-structured data in a le named active substance name Substancia.XML and the structured data in a le which is called active substance name Medicamento.XML. The web data extraction component creates an auxiliary dictionary le that is used by other components of the Information Extraction module. This dictionary contains all the chapter and active substance 1 http://www.w3.org/TR/xpath20/. 2 http://www.w3.org/TR/xquery/. 16 names. This module inserts chapter and substance names into data structures that map them to the location of the corresponding chapter and substance les stored on disk. The auxiliary dictionary le created by the web data extraction component is also further enriched by extracting medical conditions from the Médicos de Portugal website 3 . In the context of the previous version of MedicineAsk, this website contained approximately 12000 names of medical terms. These medical terms were extracted through XPath and XQuery expressions. Processing of entity references In the Infarmed website, some active substances make references to other active substances or chapters. For example, a sub-chapter may contain common data about the active substances of that subchapter. If every active substance mentioned in that sub-chapter has, for example, the same adverse reactions then these adverse reactions are detailed only once on the sub-chapter data, instead of being detailed for each individual active substance. In these cases, the active substance makes a reference to where the actual information is. In the previous example each active substance would have, on the adverse reactions section, a reference to their sub-chapter, so that a user knows what sub-chapter to access if he wants to know the reactions of that active substance. This kind of situation is called an entity reference. Entity references always come in the format V. sub-chapter name. In the Infarmed website a user can use the site's index to navigate to the section that contains the required information. As an example, if a user needs to access the information of an active substance and nds the entity reference V. (1.1.7) on the adverse reactions section, then the user can use the index to navigate to chapter 1.1.7 and read the missing information. However, MedicineAsk is a question answering system and the user cannot freely navigate to a given chapter. For that reason these entity references must be solved. The solution is to replace a reference with the missing information. For example, if we nd the reference V. (1.1.7) on the adverse reactions section of an active substance, MedicineAsk must go to chapter 1.1.7, copy the text on the adverse reactions section of that chapter and place it on the original active substance, replacing the entity reference that was there. MedicineAsk performs the reference replacement by scanning all the extracted les and analysing the text of each active substance using regular expressions with the goal of nding these entity references. When an entity reference is found, the name of the chapter or substance that has to be visited in order to nd the necessary information has to be extracted from the entity reference. To do this, the dictionary with all the chapter and substance names created in the web data extraction component is used with a dictionary based annotator available in e-txt2db [Simoes, 2009]. When the chapter/substance name is found, the next step is to nd the le of that chapter/substance on disk, in order to extract the required information. To do this, the map that was made in the web data extraction component is used. This map contains all of the chapter/substance names and their respective locations on disk. Those les are then accessed and the referenced information is extracted and replaced in the original active substance that contained the entity references. 3 http://medicosdeportugal.saude.sapo.pt/glossario. 17 Annotation After the Web-data extraction and Processing of entity references steps, MedicineAsk can answer questions such as What are the indications of paracetamol? but it cannot answer questions like What medicines treat fever?. This is because the elds that detail an active substance (e.g., indications, adverse reactions, etc.) often contain free text rather than a list of words (i.e., The texts are Indications: This medicine should be taken in cases of fever and also in case of headaches instead of Indications: fever, headache). If this type of text was annotated then questions such as What medicines treat fever? could be answered. To annotate the text, a dictionary based method, a part-of-speech tagger and a regular expression method were used. The dictionary-based method uses the dictionary with substance and chapter names produced by the web data extraction component. To identify the part-of-speech in a sentence, the part-of-speech tagger, TreeTagger4 was used. To nd medical terms the developers found and used patterns of part-of-speech classication. Regular expressions were used for dosage extraction because Infarmed represents this information in a consistent way, with child and adult dosages always separated by identical tags across every active substance. Database The extracted data is then inserted into a relational database. The database stores the following four main types of data: (i)the Infarmed website's hierarchy structure, (ii)the chapter data for each chapter of the Infarmed website, (iii) the data for each active substance, and (iv) the medication data for each active substance. The insertion of the data was performed in two steps. First, the Chapter, ActiveSubstance, Medicine and MarketingForms tables were populated by a Java application traversing the folder hierarchy with all the chapters, inserting any chapters and active substances found, along with the information on the corresponding XML les in the same folder. Second, any remaining tables were populated with the annotated information collected in the annotation component which was stored in a le (for example, a table for indications to stored the annotated indications text, another table for adverse reactions, etc.). To better answer questions by nding synonyms to words the user writes, a synonym table was created. This table was populated with only a few hand made synonyms because no source of medical term synonyms in Portuguese was found at the time. 3.1.2 Natural Language Interface When a user poses a question in natural language, it is necessary to analyse and process that question in order to be able to answer it. For that purpose, MedicineAsk includes a Natural Language Interface (NLI) module. Figure 3.2 shows the architecture of this module. The NLP module consists of three components. The rst component, named question type identication, identies what the question is about (e.g., if it is a question about adverse reactions or about the correct dosage of a medicine). The second component, question decomposition, determines which are the important entities that the 4 http://www.cis.uni-muenchen.de/schmid/tools/TreeTagger/. 18 Figure 3.2: Architecture of MedicineAsk Natural Language Interface module. Image taken from [Mendes, 2011]. question targets (e.g., which active substance do we want to know the adverse reactions of). The third component is question translation that aims at translating the natural language question into an SQL query to be posed to the database that stores the Infarmed data. These three NLP components are further described in this section. Question type identication The output of this component is a question type expression represented as predicate(parameters). For example, the questions What are the indications of paracetamol? and What are paracetamol's indications? would both be mapped to the question type expression Get Indications(ActiveSubstance). MedicineAsk identies questions through regular expressions and keyword spotting techniques. These techniques are associated to a strict and a free execution mode of this module, respectively. The strict mode is used rst. It uses regular expressions to match the question to one of several different regular expression patterns. If there is a match, the question is associated to the same question type as the type of the pattern. This mode is reliable if the user writes a question according to the pattern but that does not always happen. The free mode is activated if the strict mode fails. The free mode uses keyword spotting [Jacquemin, 2001] to determine the question's type. The question is annotated by matching words of the question to words in the dictionaries that were built during the Information Extraction phase. For example, the word Indications is annotated with an indicationsTag tag. Other words, such as medical entities are also annotated. Then, by looking at the tags, the question type can be inferred. When faced with a question that was marked with an indicationsTag that question will likely be asking for the indications of a given active substance and thus be represented by the question type expression Get Indications(ActiveSubstance). 19 If the free mode is used, the user gets notied of what the question was mapped into, so that he/she can be sure that the question was not misunderstood. Question decomposition The objective of this component is to nd the focus of the question. In the Question type identication component, the question What are the indications of paracetamol? led to the question type expression Get Indications(ActiveSubstance). Now, the goal is to nd the medical entity paracetamol in the question in order to replace ActiveSubstance by paracetamol so that Get Indications(paracetamol) will be the resulting question type expression. This component also has a strict and a free mode. The strict mode is only used if the strict mode was chosen in the Question type identication phase. In this case, the question was matched to a pattern and so we know the location of the medical entities in the question. If the question matched the pattern What are the indications of SUBSTANCE? then the medical entity in the question will be located where SUBSTANCE is in the pattern. The free mode is used if the Question type identication component used the free mode as well. Since the free mode was used, the location of the medical entities is unknown. However, the Question type identication component annotated the user question. The Question decomposition component can then analyse those annotations to know exactly which medical components are present in the user question. In the question What are the indications of paracetamol?, paracetamol will be annotated as a medical entity. Using this information the question type expression can be built. Question translation The last component poses a query to the database in order to answer the question. Following the previous example, the Get Indications(paracetamol) question type expression that was produced by the Question decomposition component is now mapped into the corresponding SQL query. Each question type expression is mapped to a different SQL query and the parameters (e.g., paracetamol) are added to the query WHERE clause. The results of the query are sent to the user interface after being converted into HTML. User Interface The user interface is implemented in JSP5 and published in a TomCat6 server. The interface contains a text box where the user can pose the question. The answers are shown below the text box, in text and/or tables depending on the type of question. Several help mechanisms are implemented in MedicineAsk. In case the user mistypes a medicine or an active substance (not uncommon with some complicated words in the medical eld) then the Soundex7 algorithm is used. This algorithm uses phonetic comparison to search for words in the database that sound similar to the word the user wrote. A user may also remember only part of a medical term, like the rst word of a medicine's name. The LIKE SQL condition 5 http://java.sun.com/products/jsp/. 6 http://tomcat.apache.org/. 7 http://en.wikipedia.org/wiki/Soundex. 20 is used in these cases, and thus nds terms in the database that contain the word the user typed with added prexes or sufxes. Finally, an auto-complete script in jQuery8 is available. It can help users by showing suggestions of possible words as the user types the question, similarly to web search systems like Google. The Soundex and LIKE help mechanisms can only be used when answering questions with the rule-based techniques. This is because the rule-based techniques know where the medical entities are. Questions are matched to a pattern, and the portion of the question that does not t the pattern is the entity. The keyword spotting technique does not know the location of a misspelled entity, so it cannot use these help mechanisms. 3.1.3 Validation of the MedicineAsk prototype MedicineAsk was evaluated with real users, in order to validate the whole prototype, verifying its usability, whether the answers retrieved were correct and if MedicineAsk was preferred over the Infarmed website. The tests with real users were divided into a developer evaluation and a user evaluation. The goal of the developer evaluation was having someone with experience with both the Infarmed website and MedicineAsk to test both systems, while trying to answer questions as fast as the system allows. To achieve this, the developers of MedicineAsk decided to answer a set of questions using both the Infarmed website and MedicineAsk, since they were familiar with both systems at the time of MedicineAsk evaluation. They concluded that both the Infarmed website and MedicineAsk could answer any of the test questions but it was faster to get answers through MedicineAsk. The user evaluation used real, possible end-users of these applications, including both common users and medical staff. This evaluation collected quantitative and qualitative measures. Quantitative measures consisted of measures such as number of clicks and time required to answer a question while qualitative measures were user satisfaction and ease of use of both systems. A ve point measuring scale was used to qualify the qualitative measures. Tests showed that both quantitatively and qualitatively MedicineAsk outperformed the Infarmed website. 3.2 LUP: A Language Understanding Platform LUP [dos Reis Mota, 2012] is a platform that enables developers to run different Natural Language Understanding (NLU) techniques and compare their results. NLU is a subtopic of NLP. It focuses more on how a machine understands a question, with less concern for what to do with that question. In this case, NLU involves understanding a user question, but not how to answer it. LUP enables a user to nd out what is the best technique and the best parameters for a given NLU problem. We propose to use it in order to test different NLU techniques that will then be integrated in the NLP module of MedicineAsk. The NLU techniques that LUP supports are: Support Vector Machine (SVM) - It uses the LIBSVM [Chih-Chung Chang, 2001] 9 library and a one-versus-all strategy meaning that instead of classifying data into multiple classes at once, it 8 http://jquery.com/. 9 http://www.csie.ntu.edu.tw/ cjlin/libsvm/. 21 classies them into belonging to class X versus not belonging to class X, for each individual class. Cross Language Model (CLM) [Leuski and Traum, 2010] - CLM is a statistical language classier that maps user questions into answers from a virtual agent. The authors of that technique developed an implementation of CLM called the NPCEditor toolkit. LUP invokes this toolkit in order to use CLM. A classication algorithm based on string similarity - This technique was developed by the authors of LUP [dos Reis Mota, 2012]. In the training corpus, each training example is associated to a semantic category. This algorithm uses string similarity techniques to compare a user's question to each training example, thus nding the semantic category with highest similarity score to the user's question. LUP supports the use of three string similarity measures: Jaccard, Overlap and Dice. It is also possible to use two combinations of Jaccard with Overlap, using weighting values. Mapping the user's input into a representation that a machine can understand can be performed in several ways. The semantic representations that LUP supports are Categories, Frames or Logical Forms. A category is associated with each question type. For example a category 1 could be associated to greetings, while a category 2 could be associated to questions regarding the weather, and so on. A frame is a set of attributes that are lled in with different values. The task for a given question consists of nding the correct frame and the correct values to be inserted into the frame's attributes. An example of a frame is a frame for controlling vehicles where an attribute Action can have accelerate or use brakes as possible values and another attribute Vehicle can have values: cars, trains, boats, etc.). Finally, logical forms represent questions through simple formulas with N-ary predicates and N arguments. As an example, for the question Who made Mona Lisa? the logical form is WhoMade(Mona Lisa). Figure 3.3: LUP architecture. The LUP architecture is composed of several modules as represented in Figure 3.3. The Front End module acts like an interface through which the user supplies information. This information consists of 22 the training corpora and some congurable parameters, namely which NLU techniques are going to be used. The Corpora Parser module is in charge of preprocessing. Preprocessing techniques include stop words removal, normalization, POS tagging and named entity recognition. Furthermore, a congurable number of partitions are randomly applied to the corpora with the purpose of generating training and test sets for later use. The preprocessing techniques to apply and the number of test partitions are parameters obtained from the Front End module. The Tester module validates the NLU techniques through cross-validation. It uses the training and test partitions created in the Corpora Parser module. The training partitions are sent to the Classier Trainer module which returns the classiers. The Tester module creates log les with information about which NLU techniques failed and why. This information shows each test example that was wrongly classied, along with the name of the wrong category, the correct category, and examples of questions from the training partition that belong to the wrongly predicted category and the correct category. The Deployer module is used to create a nal instance of LUP to be used in a given NLU problem. After using LUP to run and test several techniques simultaneously, developers can choose the single best conguration for their NLU problem. The technique that achieved greater results can then be deployed in an instance of LUP. This instance will use that technique as a classier that uses all the previously supplied corpora as training and will evaluate previously unseen examples supplied to it. Because the used NLU techniques can be computationally expensive the data generated by these techniques can be stored. The authors exemplify by stating we can store the models generated by SVM and then load them later instead of computing the necessary algorithms every time. 23 24 Chapter 4 Improvements to the Natural Language Interface module The goal of this thesis is to improve the MedicineAsk's NLI module by increasing the number and variety of user questions that the system is able to answer. User questions can be too complex (e.g. contain useless information such as the name of the patient requiring a specic medicine) or too abstract (e.g. What's the price?) to be answered in previous versions of MedicineAsk. While it is impossible to cover every possible question a user could pose, in this thesis we seek to cover enough questions so that the users of MedicineAsk can obtain the information they need by posing a Natural Language question in Portuguese. This chapter is structured as follows: Section 4.1 details a new technique to answer questions in MedicineAsk, through machine learning methods. Section 4.2 includes our approach to handle questions containing anaphoras. Finally Section 4.3 explains how we expanded the synonym detection module of MedicineAsk. 4.1 Automatic question classication The NLI module of the previous version of the MedicineAsk system uses rule-based and keyword spotting techniques [Galhardas et al., September 2012]. Rule-based methods require a user's question to exactly match a certain pattern. For this reason, these methods usually have low exibility. Keyword spotting techniques require certain keywords to be present in the question, as well as dictionaries of those keywords to be built. In this thesis, we integrated a machine learning approach for the question type classication and question decomposition steps using Support Vector Machines (SVMs) [Zhang, 2003]. The question type classication step consists on identifying the question type of a question (e.g. If it is a question regarding indications or adverse reactions). The question decomposition step consists on identifying the entities in a question (e.g. if the question is about the indications of paracetamol or the indications of aspirin). We compared several features on SVM and report the obtained results, in order to determine if machine learning brings any improvements to the previous MedicineAsk NLI. In 25 this section we explain the concept of SVM and how it was implemented. 4.1.1 Answering MedicineAsk questions with SVM SVM is a supervised learning technique which classies data into one of several different classes. Training data is composed by data instances previously assigned to each class. The training data is analysed by SVMs, which constructs a model that is then used to assign each new data instance to one of the classes. The model is represented as a space and each data instance is a point belonging to this space. SVMs create a hyperplane to divide the space into two sides. All points on one side of this hyperplane belong to one class while the points on the other side belong to the other class. New data instances will be represented as a new point in the space and their class will be determined by which side of the hyperplane they are located in. For MedicineAsk we want our questions to be classied into one of several different classes. Each class is a specic question type (e.g. a class for questions about indications, another for questions about adverse reactions, etc.). An SVM which uses more than two classes is very complex. Therefore it is best to use a one-versus-all strategy. This type of strategy divides a multi-class classication into smaller multiplications of two classes each. The smaller classication divides a question into the classes Belongs to class X or Does not belong to class X, for every class we want to evaluate. For example, for the question What are the adverse reactions of paracetamol? SVM rst classies it as either Belongs to class question about indications or Does not belong to class question about indications. Then it classies it as either Belongs to class question about adverse reactions or Does not belong to class question about adverse reactions. After the question as been tested for every possible smaller classication problem, the class that showed a higher likelihood score is chosen as the class of the question. We used several different features in SVM. A feature is a type of variable that is used to determine how the hyperplane is dened. We used unigrams, bigrams and trigrams, which give relevance to each individual word, pairs of consecutive words, or triples of consecutive words, respectively. We also used binary unigrams and binary bigrams which work similarly to regular n-grams but only take into account whether the word is present in the training corpus. We also tested length and word shape. The length feature takes into account the length of the text being analysed. Word shape is a feature which takes into account the type of characters in a word, such as numbers, capital letters and symbols like hyphens. This feature does not work well by itself but can improve the results when paired with other features [Loni et al., 2011]. In order to understand how SVM solves questions we can compare all three available techniques of MedicineAsk. The rule-based technique answers questions by matching a question to a pattern. Each pattern is associated with a question type. Each word that is not part of the pattern is part of the named entities. The keyword spotting technique analyses the words in the question and, depending on the type of words present, determines the question type (e.g. if the word indications is present then it is a question about indications). It uses the same method to discover any named entities in the 26 question. SVM has each question type mapped to a class, and attempts to classify each question into one of these classes, thus determining the question type. The named entities are discovered through a dictionary based annotator. An excerpt of the dictionary used to identify named entities is shown in Appendix B. 4.1.2 Adding SVM to MedicineAsk To answer a question using SVMs, MedicineAsk must use LUP. Using the model corpus and dictionary of named medical entities, LUP uses SVM to determine the question type and the named entities present in the question. This information is then sent back to MedicineAsk, which can now build an SQL query based on the question type and named entities of the user question. os medicamentos semelhantes Some question types include more than one entity, such as Quais sao provocam sonolencia? a Mizollen que nao (What are the medicines similar to Mizollen that do not cause sleepiness?) that contains the entities Mizollen and sonolencia. This kind of questions need additional processing in order to know which entity is which. If this processing was not performed then the SQL query will be malformed and it will not be possible to answer the question. In this example, there is a chance the SQL query would be asking for What are the medicines similar to sleepiness that do not cause Mizollen? which will lead to no results. When analysing one of these questions, SVM detects these entities and tags them with the corresponding class. In this case Mizollen is tagged as a medicine and sonolencia is tagged as a medical condition. These tags are then used to know how to build the SQL query correctly. For the example question since we know Mizollen is a medicine and sonolencia is a medical condition then we can be sure that the question posed to the database will be What are the medicines similar to Mizollen that do not cause sleepiness? and not What are the medicines similar to sleepiness that do not cause Mizollen?. We adopted three different strategies in combining question classication techniques in MedicineAsk, in order to determine which method was more effective. These strategies use the currently available NLP techniques in MedicineAsk sequentially - rule-based, keyword spotting and SVM. We considered the following strategies to integrate SVM into Medicine in order to answer questions: Strategy 1: Rule-based, falling back to keyword spotting if no match is found Strategy 2: Rule-based, falling back to SVM classication if no match is found Strategy 3: Rule-based, falling back to keyword spotting if no match is found, and then falling back to SVM classication if keyword spotting fails as well. Strategy 1 (see Figure 4.1) was featured in the previous version of MedicineAsk. It tries to answer a question using rules. If no match is found, it falls back on the keyword spotting method. This strategy was already detailed on Chapter 3. Strategy 2 (see Figure 4.2) is similar to Strategy 1: MedicineAsk rst tries to match a question to a rule-based method, and if no match is found then it falls back on SVM. Finally, Strategy 3 (see Figure 4.3) attempts to use every available method. If a question does not match any patterns using the rule-based method, then the keyword spotting technique is used. If no question 27 Figure 4.1: Strategy 1. Figure 4.2: Strategy 2. type can be determined with this technique then MedicineAsk attempts to answer it through the SVM method. The goal of using these strategies is to evaluate the difference of answering questions using only a single technique versus answering questions using combinations of those techniques. We only considered a few combinations among all the possible ones between these techniques. The rule-based method should be rst as it is the most reliable if the user matches a question exactly with a pattern. SVM is the last technique to be used because SVM never fails to assign a class to a question. We evaluated the three strategies and report the results obtained on Section 5.4. 4.2 Anaphora and Ellipsis This thesis aims at supporting questions featuring anaphora and ellipsis. Extensive work and research has been performed on these topics and a solution to resolve every possible case of anaphora and 28 Figure 4.3: Strategy 3. ellipsis has yet to be found, as shown in Section 2.5. For this reason we do not intend to give full support to these types of questions. The goal is to answer some simple questions that contain these special cases and, in the future, these features will be expanded in order to support an even larger amount of questions. 4.2.1 Proposed Solution Since anaphora resolution is a complex topic, our objective is to resolve some basic anaphora cases and build a solution that is extendible, so that more cases can be handled in the future. The systems described in Section 2.5 follow, in general, three steps to resolve anaphora: Identify anaphora and nd the anaphor; Search for and list possible candidates for antecedents; Choose an antecedent out of the candidates and use it to replace the anaphor. In these systems the the process of searching for an antecedent is complex. Several questions have to be analysed and any word in those questions is a potential antecedent. However, in MedicineAsk the possible cases of anaphora are more limited because we are dealing with a more restricted environment. Rather than free text, the user only inputs questions. There are two possible types of anaphora that can occur in a user question: medical entity anaphora and question type anaphora. With medical entity anaphora the anaphora is a medical entity, such as a medicine or active substance. For example, What are the indications of paracetamol? And what about the adverse reactions of that substance? the anaphor is that substance and the antecedent is paracetamol. For question type anaphora the anaphora is the question type, for example What are the indications of paracetamol? And those of mizolastina? where the anaphor is those and the antecedent is indications. What makes anaphora resolution in MedicineAsk easier than in other systems is that both the question type and the named entities are obtained through the process of answering a question. By storing both the question type and the entities every time a question is successfully answered, there is no need 29 to search for possible antecedents because we already possess a list of them. We do not need to reanalyse past questions and examine words such as indications to determine that the previous question type was about indications, we only need to consult the information stored beforehand. Choosing the correct antecedent is similar to Hobbs' method, but instead of ignoring antecedents that do not match gender or number, we ignore antecedents that would not make sense in the new question. For example in the question What are the indications? can be used on a medicine or an active substance but not on medical conditions (e.g. the question What are the indications of headache does not make sense). For this reason, if every possible antecedent had an entity of the type medical condition, this particular example of anaphora would not be solved. Another difference of MedicineAsk's anaphora resolution when compared to the previous systems is that MedicineAsk does not need to nd the anaphor. This is because we do not want to return the question with the anaphora resolved. Instead, we only provide the answer to the question. As long as we know the antecedent we can provide an answer to the user. For example, consider that a user inputs the question What are the indications of that active substance? and we know the question is referring to paracetamol. Our objective is not to turn the user's question into What are the indications of paracetamol? by replacing of that active substance with paracetamol. Our objective is to simply return to the user the answer he seeks by displaying the indications of paracetamol. On this thesis we will focus on medical entity anaphora. The strategy to resolve medical entity anaphora is as follows. The rst part consists on analysing regular questions with no anaphora. If the question is successfully answered, the question's question type and named entities are stored in a data structure which we will call Antecedent Storage. Figure 4.4 shows an example of this process for the question What are the indications of paracetamol?. Note that the question type Indications and the entity Paracetamol are stored in the Antecedent Storage. Figure 4.4: Answering a question with no anaphora and storing its information in the Antecedent Storage. Afterwards it is possible to analyse questions with medical entity anaphora. If a question is analysed 30 and no entities are found a case of anaphora is detected. In this case, we send the information of the question with medical entity anaphora to the Anaphora Resolver. The Anaphora Resolver then looks at the information of the question with anaphora and compares it to the information in the Antecedent Storage. If a possible antecedent for the question with anaphora is found then we return that antecedent as the entity, which will be user to answer the question. Figure 4.5 shows an example which continues from the example in Figure 4.4. This time the question is What are the generics of that medicine?. MedicineAsk identies that this question has no medical entities and thus we have a case of medical entity anaphora. The Anaphora Resolver is then in charge of nding a possible antecedent. The question contains the word medicine so we know that we are looking for a previous entity that is also a medicine. The latest entry in the Antecedent Storage was paracetamol which is an active substance, thus paracetamol cannot be the antecedent of this question. The second entry is Mizollen which is indeed a medicine, so Mizollen can be an antecedent for this question. The entity Mizollen is then combined with the question type Get Generics and sent to the Question Translation step. Figure 4.5: Solving anaphora in a question. Paracetamol is ignored because it is an active substance. 4.2.2 Implementation After successfully answering a question, the information about that question is stored in the Antecedent Storage data structure. We store the question's type and any entities found. Only the information of the last two questions is stored, however this number is congurable. This is because the older the question we analyse the less weight we give to them to help us resolve the anaphora, as the chance of them still being relevant decreases over time [Hobbs, 1978]. Then, when a new question without medical entities is received, we can use this stored information to solve an anaphora. Anaphora are solved in different ways depending on the technique being used to answer the question (rule-based, keyword spotting and machine learning). 31 With the rule-based methods, the question is matched to a pattern and thus we know the question type. However, on some of these cases, no entities are present in the part of the question that does not match a pattern (e.g. Quais as indicaçoes do medicamento? (What are the indications of the medicine? matches a pattern about indications, but medicamento is not a recognized entity). This way we identied that no entity is present. Since we know the question type we can access the Antecedent Storage to nd a suitable antecedent for the question. With the keyword spotting method, because it is searching for keywords, the method cannot determine the question type if no medical entity is present. This is because the entity is a crucial keyword to determine the question type. For this reason, if the keyword spotting method fails, we assume there is an anaphora and take whichever was the latest antecedent stored in the Antecedent Storage, appending it to the question. Afterwards, we re-analyse the question with the keyword spotting technique. This time, with an entity present on the question, the method has a higher chance of answering the question. For example, if the keyword spotting technique attempts to analyse the question What are the indications? it will fail. Suppose the Antecedent Storage contains only the entity paracetamol. In this case this entity is appended to the question, resulting in What are the indications? paracetamol. The keyword spotting technique then tries to answer this new question. In this case it will succeed. If this technique failed it is because the question was not a case of anaphora, but was simply malformed. The machine learning method can determine a question type without the presence of entities, but without an entity there is a higher failure rate. This is because the type and location of the entities in a question are part of the training data, and thus help (but are not required) with the question's classication. There are two ways to resolve anaphora with SVM. Since SVM returns a question type but no entity, we can use the question type to search the Antecedent Storage for a suitable antecedent. Afterwards we can take the question type and the antecedent and use it to answer the question. However this method uses the question type that SVM found when no entities were present. As stated before, entities help classify the question correctly. This means that there is a higher chance this question type is incorrect. There is an alternate method to avoid this problem. In this strategy if SVM fails to nd an entity, we append the latest antecedent stored in the Antecedent Storage to the question and re-analyse the question with SVM. This method is slower because it has to answer the question twice, but there's a higher chance that the question will be correctly classied if the entity is present. We tested both methods and the results are shown in Section 5.6. In either case, the user is alerted to the fact that an anaphora was detected and another entity was used, showing him/her which entity it was. This is to prevent accidents where a user either forgot or misspelled the medical entity he wanted to query about, leading the system to think it was dealing with anaphora. For example if the user asked about the indications of paracetamol and then asked about the adverse reactions of mizolastina, but misspelled mizolastina, then the user would incorrectly receive the adverse reactions of paracetamol as an answer. This implementation was created with the goal of being extensible. The rules used to determine if a given antecedent is compatible with the current question's anaphora are stored in an XML le. By editing this XML le, it is possible to easily add new rules. It is also possible to change all of the rules to t a 32 different environment. This makes it possible to use this anaphora resolver for other environments, with the only changes necessary being in the XML le. The restriction is that the new environment must be somewhat similar to MedicineAsk. This means that the new system must deal with question answering, and the question's information should be a question type plus a list of detected entities. 4.3 Synonyms As mentioned in 3.1.1 a feature to support synonyms was implemented on the previous version of MedicineAsk. This section describes our efforts to expand this feature on this version of MedicineAsk. 4.3.1 Motivation Users can use many different words for different medical entities. The data extracted from the Infarmed website uses a more formal type of speech. For example febre (fever) is known as pirexia in the Infarmed website. If a user asks a question about fever but the database only knows the term pirexia, then the question will not be answered correctly. Furthermore, users cannot be expected to know the complex terminology used in medicine, or they may just use a term that is less common for a given expression. A synonym system can bridge the gap between the terminology of a common user and the terminology of the medical information. Furthermore, it would be ideal if this synonym system was extensible so that support for new synonyms can easily be added. 4.3.2 Implementation We implemented a small program in Java that extracts synonym information from Priberam1 , which is an online Portuguese dictionary. This synonym information consists of synonyms of various medical terms. The medical terms used to search for synonyms were on the dictionary of named medical entities, which is used by MedicineAsk to nd entities in questions. Each one of these medical entities was used as a keyword on the Priberam website's keyword-based search. For every result we extracted the synonyms of that medical entity that were found in the Priberam dictionary (if available) and stored them in text documents. This information was then inserted into the MedicineAsk database's synonym table. When a user poses a question, queries to the database take into consideration both the entities found in the question and their respective synonyms. 1 http://www.priberam.pt/DLPO/. 33 34 Chapter 5 Validation This chapter describes the evaluation of the version of MedicineAsk produced as a result of this thesis, which aimed to improve the MedicineAsk NLI module. We performed various experiments to test how each of the new features of MedicineAsk improves upon the previous version. Section 5.1 describes the initial setup used for the experiments and lists the experiments performed. Sections 5.2 to 5.7 detail each one of these experiments. Section 5.8 summarises the results from the entire validation. 5.1 Experimental Setup The patterns, that the rule-based method use to determine question types, were manually built from a previously collected set of questions, as described in [Bastos, 2009]. There are 22 regular expression patterns. To train the SVM we used a training corpus named training corpus A, built from 425 questions collected during the execution of the previous thesis. Part of these questions consist on the questions that were used to create the rule-based patterns. The other part of the questions came from an experiment similar to this one, to test the previous version of MedicineAsk. This second group of questions was used to ne-tune the rule-based patterns during the previous thesis. In the training corpus A, questions are divided into 18 question types. Table 5.1 presents the number of questions per question type. For the named entity recognition task, we used a dictionary built with medical terms extracted from the Infarmed website and the Médicos de Portugal website. The dictionary includes 17982 terms: 2127 medicines, 1212 active substances and 13228 medical conditions. This is the same dictionary used by the MedicineAsk keyword spotting technique. We collected a test set, the questionnaire test set, to compare the rule-based with the SVM approach. To this end, an on-line questionnaire composed of 9 different scenarios was distributed over the internet, using Facebook1 . Appendix A shows this questionnaire. Each scenario consists of a description of a problem that is related to medicines (e.g. John needs to know the adverse reactions of Efferalgan, what kind of question should he ask?). The participants were invited to propose one or more (natural 1 https://www.facebook.com/. 35 Question Type Number of Questions QT INDICACOES QT REAC ADVERSAS QT PRECAUCOES QT INTERACCOES QT POSOLOGIA QT MEDICAMENTO QT MEDICAMENTOCONTRA QT MEDICAMENTONAO QT PRECO BARATO QT PRECO QT MEDICAMENTOSEMELHANTES NAO 53 69 35 14 48 15 7 6 15 5 38 QT MEDICAMENTOSEM PRECAUCOES 54 QT MEDICAMENTOSEMELHANTES SEM PRECAUCAO 7 QT MEDICAMENTOSEM INTERACCAO 19 QT MEDICAMENTODA SUBSTANCIA QT MEDICAMENTOCOMPARTICIPADOS QT MEDICAMENTOGENERICOS QT MEDICAMENTOINFORMACAO 4 1 30 5 Example Question do paracetamol? Quais as indicaçoes adversas do paracetamol? Quais as reacçoes do paracetamol? Quais as precauçoes Quais os medicamentos que interagem com o paracetamol? Qual a dosagem do paracetamol? Quais os medicamentos para a asma? Quais os medicamentos contra indicados para a asma? provocam febre? Quais os medicamentos para a asma que nao Quais os medicamentos mais barados do paracetamol? Qual o preço do Efferalgan R ? Quais os medicamentos semelhantes ao Efferalgan R provoque febre? que nao Quais os medicamentos para a asma que nao com a febre? exijam precauçoes Quais os medicamentos semelhantes ao Efferalgan R exijam precauçoes com a asma? que nao Quais os medicamentos para a asma que nao com o paracetamol? tenham interacçoes Quais os medicamentos existentes do paracetamol? os medicamentos comparticipados do paracetamol? Quais sao Quais os medicamentos genéricos do paracetamol? do Efferalgan R ? Quais as informaçoes Table 5.1: Number of questions for each question type in the Training Corpus A. language) questions for each scenario. We collected questions from 61 users, 19 medical professionals, and 42 common users. For the rst experiments we began by creating a subset of the questionnaire test set which we named test corpus A where we used questions from 30 randomly chosen users, 15 medical professionals and 15 common users. For this test set we test the questions of medical professionals and common users separately. This is because we believed the difference in vocabulary used by medical staff and common users would inuence the results. The test corpus A included a total of 296 questions divided into 9 scenarios. Table 5.2 shows details for each scenario. The questions were not pre-processed in any way so any errors or typos present in the questions were not removed. Scenario 1 2 3 4 5 6 7 8 9 Common user questions 18 18 16 16 15 15 15 15 15 Medical user questions 17 15 19 19 18 16 17 15 17 Total questions 35 33 35 35 33 31 32 30 32 Expected question type QT INDICACOES QT REAC ADVERSAS QT PRECAUCOES QT POSOLOGIA QT MEDICAMENTOGENERICOS QT PRECO BARATO QT MEDICAMENTOSEM INTERACCAO QT MEDICAMENTOSEMELHANTES NAO QT MEDICAMENTOSEM PRECAUCOES Table 5.2: Number of user questions and expected answer type for each scenario for Test Corpus A We also created a second training corpus called the training corpus B. This training corpus is the result of fusing the training corpus A and test corpus A. By enriching the corpus, it was expected that SVM would be able to answer more questions by using this extra information. Training Corpus B has 886 questions. Once again the corpus questions are divided into 18 question types, but only 9 of the question types were enriched (since test corpus A only uses 9 question types). Table 5.3 shows the number of questions for each question type. Appendix C shows an excerpt of Training Corpus B. A new test set is required to test training corpus B, because training corpus B is the result of fusing test set A and training corpus A. We built a second test set from the remainder of the questionnaire test 36 Question Type Number of Questions First Corpus Number of Questions Second Corpus QT MEDICAMENTOSEM PRECAUCOES 54 81 QT MEDICAMENTOSEM INTERACCAO 19 48 QT MEDICAMENTOGENERICOS 30 60 QT INDICACOES QT REAC ADVERSAS QT PRECAUCOES QT POSOLOGIA QT PRECO BARATO QT MEDICAMENTOSEMELHANTES NAO 53 69 35 48 15 38 129 168 122 112 42 46 Example Question do paracetamol? Quais as indicaçoes adversas do paracetamol? Quais as reacçoes do paracetamol? Quais as precauçoes Qual a dosagem do paracetamol? Quais os medicamentos mais barados do paracetamol? Quais os medicamentos semelhantes ao Efferalgan R provoque febre? que nao Quais os medicamentos para a asma que nao com a febre? exijam precauçoes Quais os medicamentos para a asma que nao com o paracetamol? tenham interacçoes Quais os medicamentos genéricos do paracetamol? Table 5.3: Number of questions for each question type in Training Corpus B. set, which we will call the test corpus B. This test set includes 31 users. Out of these users only 4 of them were in the medical eld. This low number does not justify making the distinction between common users and medical staff on this test set. Test corpus B included a total of 322 questions divided into 9 scenarios. Table 5.4 shows details for each scenario. The questions were not pre-processed in any way. Any errors and typos present in the questions were not removed. Scenario 1 2 3 4 5 6 7 8 9 Total questions 40 39 39 40 38 32 32 31 31 Expected question type QT INDICACOES QT REAC ADVERSAS QT PRECAUCOES QT POSOLOGIA QT MEDICAMENTOGENERICOS QT PRECO BARATO QT MEDICAMENTOSEM INTERACCAO QT MEDICAMENTOSEMELHANTES NAO QT MEDICAMENTOSEM PRECAUCOES Table 5.4: Number of user questions and expected answer type for each scenario for Test Corpus B Finally we have another subset of the test set called Test Corpus C, used to test anaphora resolution on MedicineAsk. On the same online questionnaire described above, there were other scenarios present in order to test additional question types. We wanted users to pose a question with an anaphora, but we did not want to specically instruct the users to use anaphora as we thought that would inuence their answers too much. For this reason we created a scenario which only tried to encourage users to use an anaphora. Figure 5.1 shows the scenario used for this purpose. In order to stimulate the usage of anaphora we simulated a previous interaction between the scenario's subject (John) and MedicineAsk. For Test Corpus C we took all the questions we received from all 61 users, and ltered all of those which had any reference to the medical entity. An example of a ltered question is Quanto custo o ácido acetilsalic´lico? (How much does acetylsalicylic acid cost?). There were 33 questions after ltering, such as Quanto custa? (How much is it). The following sections detail each experiment performed on this thesis. Section 5.2 details a preliminary experiment. We tested several different feature sets for SVM to discover if SVM brings any improvements over the previous version of MedicineAsk. In this experiment SVM used the Training Corpus A. The questions tested were from Test Corpus A. 37 Figure 5.1: Scenario used to encourage users to use anaphora. Section 5.3 documents the errors that emerged from integrating SVM into MedicineAsk. Section 5.4 tests different strategies of question answering for MedicineAsk, using all the currently available techniques. In this experiment SVM used the Training Corpus A. The questions tested were from Test Corpus A. Section 5.5 tests the different strategies once more after enriching the training corpus used by SVM. For this experiment SVM uses the Training Corpus B. The questions tested were from Test Corpus B. Section 5.6 tests the effectiveness of anaphora resolution. For this experiment SVM uses the Training Corpus B. The questions tested were from Test Corpus C. Section 5.7 re-runs the experiment described in Section 5.5 with anaphora resolution. This ex- periment aims at nding out if anaphora resolution improves the results from Section 5.5. For this experiment SVM uses the Training Corpus B. The questions tested were from Test Corpus B. For these experiments, the tests performed with the techniques available from LUP were performed by a simple Java script which sent each question to LUP and stored the corresponding answer in a text le. To execute the experiments on the MedicineAsk website, a Java program was created using Selenium2 together with TestNG3 . TestNG is a testing framework similar to JUnit. Selenium is a library that allows testing websites directly. Using this program, questions were automatically inserted into MedicineAsk's website. The question's answer was also automatically evaluated. 5.2 Rule-based versus Automatic Question Classication Before applying machine learning techniques directly to MedicineAsk, we rst veried if SVM had the potential to bring any improvements to the rule-based MedicineAsk NLI. The rule-based MedicineAsk NLI is the version from the previous thesis, which uses both rule-based and keyword spotting techniques to answer questions. To do this, we compared both methods against the Test Corpus A and observed 2 http://www.seleniumhq.org/. 3 http://testng.org/doc/index.html. 38 the percentage of correctly classied questions. We tested a variety of features on SVM to observe if one set of features had any major advantages over the other. 5.2.1 Results Figures 5.2, 5.3 and 5.4 show the percentage of questions correctly classied, for each scenario (1 to 9) and for the sum of all scenarios (Total). For each scenario, we show the percentage of questions correctly classied for each feature set used by the SVM and for the rule-based NLI. The rule-based MedicineAsk NLI successfully answers a question if the interface returns the correct answer. Additionally an answer by SVM is successful if both the question type and the named entities of the user question are successfully identied. Figure 5.2 shows the results for common user questions, Figure 5.3 shows the results for users in the medical eld and Figure 5.4 shows total results for all users combined. In the gures, the features are as follows: u - unigram, b - bigram, x - word shape, l - length, bu - binary unigrams, bb - binary bigrams. Each feature is explained in Section 4.1.1. Figure 5.2: Percentage of correctly classied questions by scenario for common users after improvements. Figure 5.3: Percentage of correctly classied questions by scenario for users in the medical eld after improvements. In addition to SVM, LUP can also classify questions through string similarity techniques. These techniques compare a user's question to each training corpus question using string similarity techniques. 39 Figure 5.4: Percentage of correctly classied questions by scenario for all users after improvements. Each question in the training corpus is mapped to a question type, thus if a user question is very similar to a question in the training corpus then they will share the same question type. String similarity techniques measure how similar two strings are to one another, in other others, it measures the distance between two strings. Two identical strings have a distance of 0 and certain differences, such as different letters or words between both strings, increases the distance. We ran the experiment of this Section on each of the available string similarity techniques. Figures 5.5 and 5.6 show the comparison between each string similarity technique, MedicineAsk using rulebased methods and two different combinations of features of SVM. Figure 5.5 shows the results for common user questions, Figure 5.6 shows the results for users in the medical eld Figure 5.5: Percentage of correctly classied questions by scenario for common users, using string similarity techniques 5.2.2 Discussion and Error Analysis Observing the total scores over all scenarios, we conclude that SVM has an advantage over the original rule-based and keyword spotting NLI. This is due to SVM being more exible than rule-based techniques. SVMs learns how to classify questions by themselves, through exible algorithms with much higher complexity than what man-made patterns can achieve. Machine learning techniques also do not need 40 Figure 5.6: Percentage of correctly classied questions by scenario for users in the medical eld, using string similarity techniques to heavily rely on single keywords, and instead take into account every word of every question in the training corpus to understand a question. On the other hand, a user question must match a pattern for the rule-based methods to succeed. Keyword spotting methods require certain keywords to be present in the question, and these keywords must be manually built in a dictionary. If the dictionary is poor then the questions cannot be answered. The string similarity techniques, namely Dice and Jaccard, also improve on the previous version of MedicineAsk, but their improvements are not as great as those brought by SVM. None of the methods is robust to errors made by the user. In some instances, the user misspelled certain words like medicine names. In other cases, the user made a question that was not what was expected from that scenario (e.g. asking for adverse reactions when he was supposed to ask for indications). Even if these questions were correctly classied, they were wrong in relation to the scenario. The majority of the cases in which the SVM failed were due to the fact that some words in the user's requests were not present in the corpus. For example, in the question Quais as doses pediátricas recomendadas da mizolastina? (What are the recommended dosages for mizolastina?) we nd that doses, pediátricas and recomendadas are not present in the corpus. Also, some words in the training corpus were more frequently associated with a category different from the correct one, misleading the classier. Scenario 9 leads to very long questions and all methods have a great deal of trouble classifying the question. Since the questions are very long users can express themselves in many different ways which misleads the machine learning methods. The number of questions on the training corpus for this category is also relatively low, as seen in Table 5.1, under QT MEDICAMENTOSEM INTERACCAO. We see that in general a greater number of questions were correctly classied for common users than for medical staff. This can be explained by the terminology used by each type of user. Most common users use the entity names provided in the question, Medical staff, having greater medical knowledge, sometimes use other, more complex terms, for those entities. For example, in a scenario regarding vitamin A some medical staff used the term retinoids instead. While some of these questions were correctly classied, some of these terms were too complex or returned results different from expected. We see that simply using the unigram feature already provides very favourable results. In addition, by 41 using only one feature we decrease the complexity of the classication step, which allows for questions to be classied more quickly. For this reason SVM with the unigram feature is the technique we decided to use for the remainder of the experiments. 5.3 Integrating SVM into the MedicineAsk NLI Observing the preliminary results we determined that machine learning methods have high potential to improve MedicineAsk. We then integrated this machine learning method in MedicineAsk itself. We ran the same experiment as detailed in section 5.2, with SVM being trained with Training Corpus A to classify Test Corpus A. This experiment was performed to observe if the integration had been successful. The results for the most part were the same but some issues were detected. While testing SVM separately, a question would be classied as correct if the question type and the named entities present on the question were correctly identied. While integrated with MedicineAsk however, the information obtained by SVM must still be sent to the database in order for the actual answer to be obtained. The issue was that Scenario 7 was a question regarding a medical condition called acne nodular and in the medical entity dictionaries both acne and nodular are valid entities. The SQL query building process of MedicineAsk does not expect this many entities on the question, and thus cannot retrieve an answer from the database. By observing the results manually we see that the entities were correctly extracted, but the answer is not actually obtained and so several questions were not answerable. This caused SVM to fail to answer any questions from scenario 7. To x this entity issue, we added a simple piece of code to ignore named entities that are contained in larger named entities. For example in the previously mentioned case of acne nodular the detected entities are acne nodular, acne and nodular. With this code, both acne and nodular are ignored as entities, and the entity used to answer the question will be acne nodular, as intended. 5.4 First Experiment - Combining Question Answering approaches As described in Section 4.1.2 we have three different strategies for combining question answering techniques: Strategy 1: Rule-based, falling back to keyword spotting if no match is found Strategy 2: Rule-based, falling back on machine learning if no match is found Strategy 3: Rule-based, falling back on keyword spotting if no match is found, and then falling back on machine learning if keyword spotting fails as well. The following experiments test all these strategies, measuring the percentage of questions correctly answered from Test Set A. In this experiment SVM is trained with Training Corpus A. 42 5.4.1 Results The following gures shows the percentage of questions correctly classied between the different integration strategies. An answer is correctly classied if the NLI returns the expected answer through the MedicineAsk website. Figure 5.7 shows the results for common user questions, Figure 5.8 shows the results for users in the medical eld and Figure 5.9 shows total results for all users combined. Figure 5.7: Percentage of correctly classied questions by scenario for common users for the First Experiment. Figure 5.8: Percentage of correctly classied questions by scenario for users in the medical eld for the First Experiment. 5.4.2 Discussion and Error Analysis The results and errors of this test were similar to those described in Section 5.2.2 because both the training and test corpus were the same for both experiments. Once again, one of the primary reasons for failure were errors made by the users. Misspelling words like medicine names or other important keywords such as dosagens (doses) often lead to a question being wrongly classied. There were also instances of users asking questions that were not expected on that scenario (e.g. asking for adverse reactions when he was supposed to ask for indications). 43 Figure 5.9: Percentage of correctly classied questions by scenario for all users for the First Experiment. SVM was still unable to answer certain queries, due to the corpus it was using for training not being rich enough. Of the three strategies, Strategy 3 wields the best results. This is because both the keyword spotting and the machine learning techniques are capable of answering questions the other technique cannot answer. For this reason, by combining them, a greater number of questions can be answered. 5.5 Second Experiment - Increasing the training data The purpose of this experiment was to evaluate if MedicineAsk NLI would be able to answer more questions if SVM's corpus was enriched. As explained in Section 5.1 we enriched the Training Corpus A with the Test Corpus A, creating Training Corpus B. We also removed some words from the named entities dictionary which were too generic and conicted with the question answering process (e.g. contra (contraindications) and compat´vel (compatible)). indicaçoes 5.5.1 Results Since Test Corpus A is now part of the training corpus we must use a different test corpus. For this experiment we measure the percentage of correctly classied questions from Test Corpus B. An answer is correctly classied if the NLI returns the expected answer through the MedicineAsk website. Figure 5.10 shows the results of this experiment. 5.5.2 Discussion and Error Analysis Some of the errors from this experiment are the same as the ones seen on the previous experiments and we will not go into too much detail over them. These errors include: User misspelling a word, namely medical entities (e.g. Para que serve o Effermalgan?, What is Effermalgan for?); User omitting a medical entity (e.g. para que serve este medicamento?, what is this medicine for?); 44 Figure 5.10: Percentage of correctly classied questions by scenario for all users for the Second Experiment. para tomar o Efferalgan?, What are User asked the wrong question (e.g. Quais sao as indicaçoes the indications of Efferalgan? when the scenario was about precautions); Presence of words in the medical entity dictionary that are too common (e.g. doença, disease). We once again see that the addition of SVM to the answering process of MedicineAsk NLI brings improvements. Strategy 3 still shows the best results, due to making use of all available techniques to answer questions. The reason why Strategy 3 is superior to Strategy 2 is because the keyword spotting technique can answer some questions that SVM fails to answer. For example the question about indications Que doenças se trata com Efferalgan? (What diseases are treated with Efferalgan?) is classied as a question about interactions by SVM, so Strategy 2 fails to answer this question. The keyword spotting technique, however, can successfully classify it as a question about indications and so Strategy 3 can successfully answer this question. On the other hand, any question that cannot be answered by the keyword spotting technique will be sent to SVM which will possibly answer the question correctly. This is how Strategy 3 triumphs over Strategy 2. However, Strategy 3 does not always beat Strategy 1 in each scenario. This is because the keyword spotting and the machine learning techniques work differently. Strategy 3 fails to answer some questions that Strategy 2 successfully answers. This is caused by the keyword spotting technique answering a question successfully but misinterpreting it. In some cases, SVM would have successfully answered the question but, because the keyword spotting technique already answered it, SVM will never have a chance at answering the question. For example, the question about precautions Como tomar Efferalgan? (How to take Efferalgan?) is classied by the keyword spotting technique as a question about dosages, while SVM classies it as precautions. Note that this question is ambiguous and that neither of the methods was fully wrong in its analysis, but because the scenario expected the answer to be precautions, only Strategy 2 succeeds. This issue affects scenario 3 the most. Strategy 2 and Strategy 3's percentage of correctly classied questions improved on Strategy 1 by 15% and 17% respectively. Both methods show good results, but Strategy 3 has better results. Another 45 reason to use Strategy 3 is speed. SVM has the downside of being slower when answering questions versus the other techniques. During the experiments detailed above, we observed that answering a question using the keyword spotting technique took, on average, approximately 1 second. Using SVM answering questions would take on average 5 seconds. While times may vary depending on the machine running MedicineAsk, every second is important when dealing with user interfaces, because slow interfaces can decrease user satisfaction. For these reasons we have decided to use Strategy 3 as the nal strategy in MedicineAsk. 5.6 Anaphora Evaluation In this section we attempt to measure the improvements brought by the implemented anaphora resolution techniques to MedicineAsk's question answering. We start with a validation of a scenario specically designed for anaphora. We ran Test Corpus C through Strategies 1, 2 and 3, plus the previous version of MedicineAsk which does not include anaphora resolution. As mentioned in Section 4.2.2 there are two ways to handle anaphora with SVM answers. We tested Strategies 2 and 3 with this alternate anaphora resolution method and dubbed these new strategies as Strategy 2.5 and Strategy 3.5 respectively. Thus we have: Previous MedicineAsk - Strategy 1 with no Anaphora Resolution Strategy 1 - Strategy 1 with Anaphora Resolution Strategy 2 - Strategy 2 where SVM answers a question once and the resulting question type is paired with the antecedent found for the current anaphora. Strategy 2.5 - Strategy 2 where SVM answers a question with an anaphora, nds an antecedent, appends it to the original question and re-answers that question. Strategy 3 - Strategy 3 where SVM answers a question once and the resulting question type is paired with the antecedent found for the current anaphora. Strategy 3.5 - Strategy 3 where SVM answers a question with an anaphora, nds an antecedent, appends it to the original question and re-answers that question. For all strategies, if SVM answers a question and no entities are found, an entity is fetched from the anaphora resolver. The difference is Strategies 2 and 3 use the question type originally found on the rst analysis of the question while Strategies 2.5 and 3.5 append the antecedent entity to the question and have SVM classify the question once again. The idea is that the rst method is faster, but because there was no entity in the question then there is a higher chance the classication of the question type will be wrong. The second method re-analyses the question with the presence of an entity so there is a higher chance for the classication to be correct, but SVM must analyse the question twice which is signicantly slower. 46 5.6.1 Results Table 5.5 shows the results. The percentage of correctly identied anaphora shows the number of questions where the anaphora was resolved but the question was wrongly answered, because the question type was not correct. The percentage of questions correctly answered shows the questions that had both the correct resolution of the anaphora and the correct answer to the question. Strategy Previous MedicineAsk Strategy 1 Strategy 2 Strategy 2.5 Strategy 3 Strategy 3.5 Percentage of correctly identied anaphora 0% 67% 64% 70% 88% 94% Percentage of questions correctly answered 0% 67% 15% 0% 79% 79% Table 5.5: Percentage of questions with anaphora correctly classied 5.6.2 Discussion and Error Analysis The previous version of MedicineAsk is unable to answer any questions as there is no anaphora resolution in this version. When resolving anaphora from questions answered by SVM, the anaphora were correctly resolved but the answers were not retrieved properly. The reason for this is that the questions in this test were of the type QT PRECO (questions about prices of medicine) and there are only 5 questions of this type on the training corpus. We nd, however, that strategy 2 can identify more questions correctly than Strategy 2.5. This means analysing the question without the entity being present, for this case, provided better results. Both Strategies 2.5 and 3.5 identify more anaphora than Strategies 2 and 3 respectively, but do not answer a greater number of questions. The reason for this is that a question can lead to no results if it does not make sense. For example a question such as What are the medicines that can cure vitamin A? is not possible. Since SVM in Strategies 2 and 3 analyse the question without any entities present, there is a chance the question type will not be compatible with any of the anaphora antecedents currently stored. If SVM classies a question as asking about medicines for a given medical condition, but only medicines and active substances are present in the list of possible antecedents then the anaphora cannot be resolved. On the other hand Strategies 2.5 and 3.5 take the rst anaphora they nd and append it to the question. The question is then re-analysed with the entity and there is a higher chance the question will make sense. Some of the questions with the incorrect question type in Strategy 3 failed due to certain keywords. The wording made by the user sometimes caused MedicineAsk to incorrectly classifying a question (e.g. the presence of the word genérico generic in the question caused the question to be classied as a question regarding the generics of a medicine rather than the price of the medicine). The only anaphora that were not identied in any method contained other words which were unintentionally classied as an entity. Since an entity was present, no anaphora could be detected. 47 5.7 Testing Test Corpus B with anaphora resolution To demonstrate anaphora resolution in a different way we performed the same experiment as the one from Section 5.5 with anaphora resolution. We compared the results of Strategies 1 and 3 from that experiment, with the results from running the same experiment but with anaphora resolution. 5.7.1 Results Figure 5.11 shows the percentage of questions correctly classied for the questions in Test Corpus B. An answer is correctly classied if the NLI returns the expected answer through the MedicineAsk website. Figure 5.11: Percentage of correctly classied questions by scenario for all users for the Second Experiment with Anaphora Resolution. 5.7.2 Discussion and Error Analysis Note that these improvements may not be accurate. This is because the anaphora present in Test Corpus B may not have been intentional. If a user made a question such as What are the adverse reactions of the medicine? because he assumed the system would still remember which medicine he/she was talking about, then it is indeed a case of anaphora and the results are correct. However, some users may have forgotten to type the entity, or simply misspelled it. If this is the case then resolving the anaphora could lead to mistakes by the system. For example the user may have made an initial question about the indications of paracetamol, and then made a question about the adverse reactions of mizolastina. However, if he/she forgets to type mizolastina, or misspells that word, an anaphora will be detected. In this case MedicineAsk will return the adverse reactions of paracetamol, which is not the answer the user was interested in. We see that, in total, Strategy 3 with anaphora resolution correctly answered 5% more questions than regular Strategy 3. Most of the questions missing an entity were correctly classied. The only exception occurred when a question was marked with an unintended entity (such as pré- pre-) which was stored as a possible antecedent. Immediately after a question with no entities is posed, and the unintended entity is used to answer this question, which often led to no results. 48 Anaphora resolution for medical entities brings improvements to MedicineAsk. It can also serve as a tool to prevent spelling errors by users, as long as they were querying the same entity between questions. To avoid errors a warning is shown telling the user that no entity was detected and a previous one was used. Identifying unintended entities in a question with anaphora leads to issues with both anaphora detection (because since an entity is present no anaphora can be detected) and anaphora resolution (because the unintended entity is stored and possibly used in a later question). A thorough cleaning and processing of the named entity dictionary should be performed in the future to avoid these types of errors. Strategy 3 with anaphora resolution still seems to show the best results, and will be the strategy used by this version of MedicineAsk. Strategy 3.5 does not improve on the results enough to justify answering each anaphora question twice using SVM. 5.8 Conclusions Strategy 3 shows an increase of 17% in question answering during the experiment described in Section 5.5. The experiment in Section 5.6 shows us that the only medical entity anaphora that cannot be resolved are ones where an unexpected entity was detected. From the experiment described in Section 5.7 we see that Strategy 3 with anaphora resolution answers 5% more questions than regular Strategy 3 and 18% more questions than regular Strategy 1 with anaphora resolution. The 5% increase in correctly answered questions just from using anaphora resolution shows us that misspelling of medical entities is one of the biggest problems in MedicineAsk. The rule-based method has techniques to solve misspelled terms (as described in 3.1.2). However these techniques only work because the rule-based method knows exactly where the medical entity should be, and thus can obtain the misspelled word and process it in order to try and obtain an answer. The keyword spotting and machine learning techniques do not require entities to be on specic locations, so special techniques would be required to identify the presence of a misspelled word. If, somehow, the misspelled word could be found then we could apply the same technique that the rule-based method uses to obtain an answer to the question. One possible solution would be to remove some words from a question through stop words, then use string similarity techniques to compare every remaining word to the named entity dictionary. Omission of words is another difcult topic to handle. We do not know for sure why a user would not include the medical entity in his question, but there are some possibilities: 1. The user forgot to include the entity; 2. The user did not know he was supposed to insert an entity on the question; 3. The previous question was about the same entity, so the user assumed the system would still remember which entity was being discussed. 49 Point 3 is a case of anaphora. Points 1 and 2 should not be handled by the system because it is a user error. The issue is that it is difcult to distinguish between the three points. Our anaphora method will assume there is an anaphora on every one of these situations and handle it accordingly. Our way of handling the issue is by warning the user that we detected an anaphora and for that reason used the previous entity to answer his question. Users asking the wrong question also caused failures in the validation and could have a variety of reasons. If the user made the mistake because he/she misunderstood the scenario's description, then this problem would not occur in the real world as users would not be following someone else's scenario. If the user typed the wrong question because he/she does not know the difference between terms like interactions and precautions then it could potentially be interesting to have a mechanism of related questions where topics deemed similar show up on the user's screen after posing a question. Another issue commonly found was related to the presence of words in the corpus that are too common. This issue requires the dictionary of named entities to be analysed and processed. Each word that is too common and could interfere with question answering should be removed. The dictionary's size makes this task difcult. Another issue is that removing certain words we think are common might actually hurt the system's ability to answer questions. To x the issue where the keyword spotting does not let SVM answer a question successfully the ideal solution would be to involve the user. If, while using Strategy 3, an answer is obtained through the keyword spotting technique, a prompt could appear on screen inquiring the user if that question was correct. If the question was incorrect then SVM would attempt to answer the question, and possibly be successful. 50 Chapter 6 Conclusions This chapter describes the main conclusions obtained from the research and development of this thesis and of the new version of the MedicineAsk NLI module. Section 6.1 shows a summary of this thesis, as well as the contributions brought. Section 6.2 presents some limitations of our version of MedicineAsk as well as possible solutions to those limitations. We also mention some ideas on how MedicineAsk could be further improved. 6.1 Summary and Contributions This thesis presents an improvement to the NLI module of MedicineAsk, a system that answers Portuguese Natural Language questions about active substances and medicines. Data was extracted from the Infarmed website, processed and added to a database. MedicineAsk then allows users to access this information through a NLI. This means users can access information about medicine and active substances using daily language, without having to learn any new specic means of communication with the system. This thesis focused on improving the NLI module of MedicineAsk. The main objective was to increase the quantity of user questions that MedicineAsk can answer. We also aimed to test several different congurations and strategies of the question answering techniques, to determine which ones brought better results. The work developed in this thesis resulted in the following contributions: State of the art on question answering systems for the medical eld as well as information retrieval systems for the medical domain. We also present a state on the art on different web-based systems that provide medical information. An improvement to the MedicineAsk NLI module, which namely includes: The addition of machine learning techniques on MedicineAsk. We added SVM to the techniques MedicineAsk uses to answer questions. We also tested several different SVM features to test which one is more useful for our problem. Added support for questions with anaphora and ellipsis. Anaphora and ellipsis occur when a previously mentioned entity is referred without the actual name of the entity. For example 51 What are the indications of Mizollen? And the adverse reactions of that medicine?. On the second question, of that medicine references to Mizollen. This type of questions can occur when, for example, a user wants to avoid typing an entity several times in a row. Due to questions with anaphora and ellipsis not having any medical entities present in them, they could not be answered previously. We keep a short history of questions. In case of anaphora, MedicineAsk then analyses this history and chooses a previous entity to answer the present question. This part of the work was made portable and can be used for other question answering environments that represent questions in a similar way to MedicineAsk. The extension of synonym detection, implemented on the previous version of MedicineAsk. This feature was implemented but a comprehensive list of synonyms was not found due to time constraints. We collected such a list and inserted it into the database. A validation of every new addition to this version of MedicineAsk, where we test several different question answering strategies to measure which retrieves the best results. We also identied the current issues of MedicineAsk and possible solutions. This thesis resulted in a paper and poster that were published in the 10th International Conference on Data Integration in the Life Sciences. 6.2 Future Work We have identied several limitations with the new version of MedicineAsk. In this section we propose some ideas for future solutions to these issues. Sections 6.2.1 to 6.2.5 detail these limitations. During the research performed for this thesis we also identied some other possibly interesting areas of development for MedicineAsk, as shown in Sections 6.2.6 to 6.2.8. 6.2.1 Additional question types Currently our system answers questions that only cover one topic, for example What are the indications of paracetamol?. However, a user may want to pose a more complex question, asking for more than one type of information about a given substance or medicine, such as What are the indications and adverse reactions of paracetamol?. Currently, to do this, the user would have to pose two different questions to obtain all the information he seeks. To support these types of questions, a question would have to be analysed for more than one question type, and all the identied question types would have to be stored. This could be achieved by the keyword spotting and SVM techniques. The keyword spotting method would have to be changed to take into account every keyword found on the question. For example, if it nds a keyword for indications it should not stop and assume the question is about indications, but instead should further analyse the questions in search of other keywords, such as a keyword for adverse reactions. The SVM method currently analyses the question and returns the most likely class that belongs to that question. To support questions with more than one topic, SVM would need to return 52 all of the classes that have a given threshold score, which may not be trivial. After storing every question type found on the question, the NLP module would build an SQL query to retrieve from the database all the necessary information. In the previous example it would return both the indications and adverse reactions of paracetamol at once. Other types of complex questions include questions that contain negation. Detecting negation can be important if, for example, a user wants to know about a medicine that does not cause a given reaction. The current version of Medicine.Ask can answer some questions with negation, for example, What medicines cure fever but do not cause sleepiness?. There are, however, still several cases where the negation will not be found. 6.2.2 Additional strategies for question answering technique combinations Several strategies were tested in this version of MedicineAsk, which combined rule-based, keyword spotting and SVM techniques in different sequential orders. While these strategies proved to be effective there are types of strategies that would be interesting to test. One strategy consists on using a voting model. This consists on having all available techniques provide their own answer for a given question. Then, all of the answers are judged and the one that is considered correct will be the answer returned by the system. The techniques used to judge which is the best answer would have to be studied further. Another possibility is a human in the loop model. In this model the techniques are still used sequentially, but the user can express if he/she was satised by the system's answer. If the user replies negatively then the next technique on the sequence is used to attempt to answer the question once again. For example, in Strategy 3, which consists of rule-based, keyword spotting and SVM techniques used sequentially, the keyword spotting technique may return an answer to a question. Afterwards a prompt appears asking if the user is happy with the answer. If the user chooses No then MedicineAsk attempts to answer the question using SVM. This would solve one of the issues identied in Section 5.5.2 where the keyword spotting method would successfully nd an answer to the question but the answer was incorrect. Since technically the keyword spotting method did not fail then SVM had no opportunity to answer the question. By involving the user, SVM can attempt to answer a question even if the keyword spotting method initially prevents this. It would also be interesting to use the same method to include user feedback. The user feedback could then be used to automatically learn ways of improving MedicineAsk. 6.2.3 Addressing common user mistakes The 5% increase in correctly answered questions just from using anaphora resolution, seen in Section 5.7, shows us that misspelling of medical entities is one of the biggest problems in MedicineAsk. The rule-based method has techniques to solve misspelled terms (as described in 3.1.2). However these techniques only work because the rule-based method knows exactly where the medical entity should be, and thus can obtain the misspelled word and process it in order to try and obtain an answer. The keyword spotting and machine learning techniques do not require entities to be on specic locations, so 53 special techniques would be required to identify the presence of a misspelled word. If, somehow, the misspelled word could be found then we could apply the same technique that the rule-based method uses to obtain an answer to the question. One possible solution would be to remove some words from a question through stop words, then use string similarity techniques to compare every remaining word to the named entity dictionary. Users asking the wrong question also caused failures in the validation and could have a variety of reasons. If the user made the mistake because he/she misunderstood the scenario's description, then this problem would not occur in the real world as users would not be following someone else's scenario. If the user typed the wrong question because he/she does not know the difference between terms like interactions and precautions then it could potentially be interesting to have a mechanism of related questions where topics deemed similar show up on the user's screen after posing a question. Another issue commonly found was related to the presence of words in the corpus that are too common. This issue requires the dictionary of named entities to be analysed and processed. Each word that is too common and could interfere with question answering should be removed. The dictionary's size makes this task difcult. Another issue is that removing certain words that we think are common might actually hurt the system's ability to answer questions. 6.2.4 User evaluation Due to time constraints, a user evaluation was not performed on this version of MedicineAsk. This kind of evaluation is important to measure how MedicineAsk compares to the Infarmed website. It is important to know the level of satisfaction users of MedicineAsk have, and how easily they obtain their answers. Ideally this evaluation would be performed with both common users and medical staff, to identify if both types of users are equally satised with MedicineAsk. 6.2.5 Question type anaphora We have a question type anaphora if MedicineAsk cannot nd a suitable question type for a given question (e.g. What are the indications of paracetamol? And those of mizolastina?, the second question has no question type, as its referencing the rst one). This type of anaphora is possible on MedicineAsk, but support for it was not added. This is because we did not nd a suitable solution for this problem. MedicineAsk's current rule-based methods will not be able to answer these questions, as each pattern is based on each question type and there is currently no pattern for no question type. Creating these patterns is not trivial, because, by not having a question type, these questions have very little information to go on. The keyword spotting techniques will also not have enough information to answer the question. One possible method is as follows, consider that the previous input question was What are the indications of paracetamol? : 1. Analyse a question using the keyword spotting technique (e.g., And those of mizolastina? ) 54 2. No question type found, assume it is a medical entity anaphora and treat it as one (as detailed in Section 4.2.2). Concatenate latest entity from Antecedent Storage to the original question (resulting in And those of mizolastina? paracetamol ) 3. Re-analyse this new question again, with the keyword spotting technique 4. No question type found, extract the question type of the latest question in the Antecedent Storage (indications. 5. Convert the question type into a template question (e.g., What are the indications of) 6. Append the original question to this template question (e.g., What are the indications of And those of mizolastina? 7. Re-analyse this new question once again Through this method, it might be possible for the keyword spotting techniques to answer some of these questions. The issue with this solution is that it might become impossible for the keyword spotting technique to fail, because in the worst case it will simply return the answer to the previous question. This becomes a problem with Strategy 3, since SVM is only used when the keyword spotting technique fails, no questions will be answered using SVM. Another possible solution for keyword spotting would involve observing if a question has keywords related to entities but no keywords for question types. For example, in the question And those of mizolastina? the keyword spotting technique would detect a keyword of the type active substances due to mizolastina. However no keywords would be found for question types such as indications or adverse reactions. Using this information it could be possible to determine the existence of a question type anaphora. Machine learning techniques have the problem of attempting to classify a question even if no question type is present. This can be solved by either creating a class of questions for SVM, specically for cases of anaphora, or by having a threshold score, where if no question type candidate is found with a match above a given score then MedicineAsk assumes that question has no question type. SVM uses a score to determine how likely it is for a question to belong to a class. The idea would be to assume there is no question type if the score given to the chosen class is too low. The issue with a threshold score is nding the correct value. Some questions which were correctly classied but with a low score will now be wrongly classied. On the other hand if the threshold is too low then no anaphora will ever be detected. A class of questions with no question type would consist of a class in the training corpus consisting entirely of questions with no question type (e.g., questions such as And those of SUBSTANCIA ACTIVA?. The issue here is the very low quantity of information in each question of the training corpus. It would be difcult to classify a question using this method. 6.2.6 Updates to Information Extraction Some years have passed since the last time information was extracted from the Infarmed website and now its structure has changed. If the MedicineAsk Information Extraction module was used now it is 55 likely that it would not succeed in extracting any new information from the Infarmed website. It is then necessary to update the MedicineAsk Information Extraction module. It would also be interesting to have an automatic information extraction system. This would make it possible to extract information from the Infarmed website even if the website suffered further changes, without having to alter any code from the information extractor. There are also other sources of information that could be explored. For example, it was suggested to us by an acquaintance involved in the medical eld that MedicineAsk should be able to distinguish adverse reactions by their different kinds of frequency levels. Information leaets that come with medicines usually contain the adverse reactions divided by frequency, such as rare adverse reactions and frequent adverse reactions. It would be useful for a user to know if the adverse reactions he or she is experiencing are common and expected or if they are of a very rare kind that might warrant a visit to the doctor. Unfortunately the Infarmed website does not distinguish between different degrees of frequency on adverse reactions. While there are some cases where frequency is mentioned they are few and poorly structured. Due to the small amount of information available, adding support for questions that distinguish the frequency of adverse reactions will have weak results. However, Infarmed also has a public database called Infomed1 . This database lists several different kinds of information about medicines. This information includes links to les with scans of the information leaets that come with each medicine. These information leaets do distinguish adverse reactions based on how frequent they are. It would be interesting to extract this information and add it to the database, while also allowing support for questions regarding the frequency of adverse reactions, and any other available type of information on these information leaets. 6.2.7 MedicineAsk on mobile platforms As mentioned in Section 1, medical information must sometimes be accessed quickly. For example a doctor might need to conrm a diagnostic on an emergency patient or a common user may be suffering an emergency and needs to know more about one of his medicines. However, the current MedicineAsk system was made for traditional web browser interfaces. In emergency situations it may not be possible to access a computer in time. A mobile application of MedicineAsk could solve this issue. Making the information on the Infarmed website available through a portable device such as a smart phone would make this information much more accessible. 6.2.8 Analysing Portuguese NL questions with information in other languages The goal of MedicineAsk is to answer questions in the Portuguese language. To do this we have been using Portuguese resources in order to answer Portuguese questions. The big disadvantage is that the amount of medical information resources for the Portuguese language is very limited. Other languages, such as English, have a much greater quantity and variety of medical information. There is also a greater deal of investigation on English NLP when compared to Portuguese NLP. 1 http://www.infarmed.pt/infomed/inicio.php. 56 There are, however, ways of answering Portuguese questions while using resources in other languages, such as English. There has been investigation regarding this topic such as systems that answer French Natural Language questions with English Information [Grau et al., 2007]. In these works two noteworthy strategies are described, and we will use the example of answering a Portuguese question with English information: The rst strategy consists on translating the Portuguese question posed by the user to English, then analyse it with English NLP techniques and get an answer using English resources. The advantage is the ability to analyse the question with English NLP techniques (which may be superior). The disadvantage is that the question may be mistranslated and no information will be extracted. The second strategy consists on analysing the Portuguese question with Portuguese NLP techniques and then translate the extracted information (e.g. the question type and medical entities) into English and use that information to query the English resources. The advantage is that the question is analysed on the original language, meaning the question will not suffer from syntactic or semantic errors caused by the translation. The disadvantage is that the translated information may lose context because each expression is translated one by one without the help of the original question. It would also be possible to run both methods in a hybrid fashion and use the one with better results. 57 58 Bibliography A. R. Aronson. Effective mapping of biomedical text to the UMLS metathesaurus: the metamap program. AMIA Annu Symp Proc, pages 1721, 2001. J. P. V. Bastos. Prontuário terapeutico: Medicine.ask. (in Portuguese). Master's thesis, Instituto Superior Técnico, 2009. A. Ben Abacha. Recherche de réponses précises a des questions médicales: le systeme de questionsréponses MEANS. PhD thesis, Universite Paris-Sud, 2012. A. Ben Abacha and P. Zweigenbaum. A hybrid approach for the extraction of semantic relations from MEDLINE abstracts. Computational Linguistics and Intelligent Text Processing (CICLing), 6608:139 150, 2011a. A. Ben Abacha and P. Zweigenbaum. Medical entity recognition: A comparison of semantic and statistical methods. BioNLP, pages 5664, 2011b. A. Ben Abacha and P. Zweigenbaum. Automatic extraction of semantic relations between medical entities: a rule based approach. Journal of Biomedical Semantics, 2011c. A. Ben Abacha and P. Zweigenbaum. Medical question answering: Translating medical questions into SPARQL queries. ACM SIGHIT International Health Informatics Symposium (IHI 2012), 2012. A. R. Chaves and L. H. Rino. The Mitkov Algorithm for Anaphora Resolution in Portuguese. Proceedings of the 8th International Conference on Computational Processing of the Portuguese Language (PROPOR '08), pages 5160, 2008. Chih-Jen Lin Chih-Chung Chang. LIBSVM: a library for Support Vector Machines. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm, 2001. D. Damljanovic, M. Agatonovic, and H. Cunningham. Natural language interfaces to ontologies: Combining syntactic analysis and ontology-based lookup through the user interaction. Extended Semantic Web Conference (ESWC2010), June 2010. A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the em algorithm. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, 1977. 59 P. J. dos Reis Mota. LUP: A language understanding platform. Master's thesis, Instituto Superior Técnico, 2012. H. Galhardas, L. Coheur, and V. D. Mendes. Medicine.Ask: a Natural Language search system for medicine information. INFORUM 2012 - Simpósio de Informática, September 2012. B. Grau, A. Ligozat, I. Robba, A. Vilnat, M. Bagur, and K. Séjourné. The bilingual system MUSCLEF at QA@CLEF 2006. In Carol Peters, Paul Clough, FredricC. Gey, Jussi Karlgren, Bernardo Magnini, DouglasW. Oard, Maarten de Rijke, and Maximilian Stempfhuber, editors, Evaluation of Multilingual and Multi-modal Information Retrieval, volume 4730 of Lecture Notes in Computer Science, pages 454462. Springer Berlin Heidelberg, 2007. ISBN 978-3-540-74998-1. doi: 10.1007/ 978-3-540-74999-8 54. URL http://dx.doi.org/10.1007/978-3-540-74999-8_54. J. R. Hobbs. Resolving pronoun references. Lingua, 44:311338, 1978. C. Jacquemin. Spotting and discovering terms through natural language processing. The MIT Press, 2001. E. Kaufmann and A. Bernstein. How useful are natural language interfaces to the semantic web for casual end-users? European Semantic Web Conference (ESWC 2007), 2007. A. Leuski and D. R. Traum. Practical language processing for virtual humans. Innovative Applications of Articial Intelligence (IAAI-10), pages 17401747, 2010. F. Li and H.V. Jagadish. Constructing an interactive natural language interface for relational databases. Proceedings of the VLDB Endowment, 8, 2014. B. Loni, G. van Tulder, P. Wiggers, D.M.J. Tax, and M. Loog. Question classication by weighted combination of lexical, syntactic and semantic features. In Text, Speech and Dialogue. Springer Berlin Heidelberg, 2011. ISBN 978-3-642-23537-5. STRING: an hybrid statistical and rule-based Natural N. Mamede, J. Baptista, C. Diniz, and V. Cabarrao. Language Processing Chain for Portuguese. PROPOR '12 (Demo Session), Coimbra, Portugal., 2012. I. Marcelino, G. Dias, J. Casteleiro, and J. Martinez. Semi-controlled construction of the European Portuguese Unied Medical Language System. Workshop on Finding the Hidden Knowledge: Text Mining for Biology and Medicine (FTHK 2008), 2008. J. S. Marques. Anaphora resolution in Portuguese: An hybrid approach. Master's thesis, Instituto Superior Técnico, 2013. V. D. Mendes. Medicine.ask: an extraction and search system for medicine information. Master's thesis, Instituto Superior Técnico, 2011. R. Mitkov. Anaphora Resolution. Pearson Prentice Hal, 2002. 60 G. Savova, J. Masanz, P. Ogren, J. Zheng, S. Sohn, K. Kipper-Schuler, and C. Chute. Mayo clinic clinical Text Analysis and Knowledge Extraction System (ctakes): architecture, component evaluation and applications. Journal of the American Medical Informatics Association (JAMIA), 17:507513, 2010. G. F. Simoes. e-txt2db: Giving structure to unstrutured data. Master's thesis, Instituto Superior Técnico, 2009. O. Uzuner, B. R. South, S. Shen, and S. L. Duvall. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association (JAMIA), June 16 2011. Lee W. S. Zhang, D. Question classication using support vector machines. In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, pages pages 2632, 2003. 61 62 Appendix A Questionnaire used to obtain test corpura This appendix shows the questionnaire that was sent through Facebook to collect questions to test MedicineAsk. It starts with a brief explanation of MedicineAsk as well as what was expected of the user. It is then followed by 12 scenarios where the user should propose a one or more question formulations to solve the scenario. Scenarios 5 and 6 were not used for any experiments. Scenario 9 was the scenario which encouraged users in using anaphora. MedicineAsk sobre medicamentos em l´ngua O MedicineAsk é um sistema que permite a pesquisa de informaçao natural, isto é, a linguagem que se usa no dia a dia. Por exemplo, para saber quais os medicamentos para a febre, basta perguntar ao MedicineAsk: Quais os medicamentos para a febre?. No quadro de uma tese de mestrado, precisamos da sua ajuda na recolha de diferentes formulaçoes de perguntas sobre medicamentos. Para isso, gostariamos que lesse os cenários apresentados abaixo e, de seguida, escrevesse uma ou mais perguntas que formularia ao MedicineAsk para resolver cada necessitam de ser muito complexas, ou de ter muitos detalhes. um dos cenários. As perguntas nao Estimamos que necessite de 5 minutos para completar o questionário. Agradecemos desde já a sua participaçao. Cenário 1 encontrou uma caixa de Efferalgan R sem o planeto informativo e gostaria de saber para O Joao que serve este medicamento. Sugira uma pergunta a submeter ao MedicineAsk para obter essa informaçao. Cenário 2 gostava de saber que efeitos indeTambém relativamente ao medicamento Efferalgan R , o Joao ocorrer quando tomar esse medicamento. Qual seria uma poss´vel pergunta a fazer sejáveis poderao ao sistema, para responder a esta questao? 63 Cenário 3 pretende saber quais os cuidados que deve ter antes de tomar o medicamento Efferalgan R . O Joao Qual seria uma poss´vel pergunta a fazer ao sistema, para responder a esta questao? Cenário 4 a substancia pretende saber qual a Foi receitada ao lho do Joao activa mizolastina. O Joao dosagem indicada da mizolastina para crianças. Qual seria uma poss´vel pergunta a fazer ao sistema, para responder a esta questao? Cenário 5 pretende anotar a dosagem recomendada de Mizollen R e Efferalgan R , num bloco de noO Joao fazer uma única pergunta para obter as dosagens recomendadas de tas em sua casa. Decide entao ambos os medicamentos. Qual seria uma poss´vel pergunta a fazer ao sistema, para responder a esta questao? Cenário 6 quer ter a certeza de que os medicamentos que toma, Mizollen R e Efferalgan R , nao tem O Joao perigosas entre si. Qual seria uma poss´vel pergunta a fazer ao sistema, para responder a interacçoes esta questao? Cenário 7 toma paracetamol, comprando marcas como o Panadol R . No entanto, deseja passar a O Joao tomar medicamentos genéricos da substancia activa paracetamol. Qual seria uma poss´vel pergunta a fazer ao sistema, para responder a esta questao? Cenário 8 pretende saber quais sao os medicamentos mais baratos da substancia O Joao sinvastatina. Qual seria uma poss´vel pergunta a fazer ao sistema, para responder a esta questao? Cenário 9 pretende saber o preço de medicamentos com ácido acetilsalic´lico. Sabendo que o Joao O Joao já fez uma pergunta sobre as indicaçoes do ácido acetilsalic´lico e já obteve a sua resposta, ambas mostradas em baixo, como formularia a pergunta sobre o preço dos medicamentos com ácido acetil Quais as indicaçoes do ácido acetilsalic´lico? MedicineAsk: Indicaçoes: salic´lico? Joao: Dor ligeira a moderada; pirexia. Prolaxia secundária de acidentes cardio e cerebrovasculares isquémicos. Joao: Cenário 10 pediu ao seu médico que lhe receitasse medicamentos para a acne nodular que nao interaO Joao podia fazer ao sistema, jam com a Vitamina A. Qual seria uma poss´vel pergunta que o médico do Joao para responder a esta questao? Cenário 11 pediu ao seu médico para receitar um medicamento semelhante ao Mizollen R que nao O Joao provoque sonolencia como efeito secundário. Qual seria uma poss´vel pergunta que o médico do Joao podia fazer ao sistema, para responder a esta questao? Cenário 12 perguntou ao seu médico que medicamentos pode tomar para a hipertensao, que nao exijam O Joao 64 podia fazer ao sistema, cuidados com a asma. Qual seria uma poss´vel pergunta que o médico do Joao para responder a esta questao? 65 66 Appendix B Dictionary used to identify named medical entities in a user question (Excerpt) This appendix shows an excerpt of the dictionary of named medical entities. Both the keyword spotting technique and SVM use this dictionary to nd entities in user questions. SUBSTANCIA_ACTIVA ALFACALCIDOL SUBSTANCIA_ACTIVA PIRITIONA ZINCO SUBSTANCIA_ACTIVA PIMECROLIMUS SUBSTANCIA_ACTIVA TROPISSETROM SUBSTANCIA_ACTIVA CIAMEMAZINA SUBSTANCIA_ACTIVA FLUFENAZINA SUBSTANCIA_ACTIVA ISONIAZIDA + PIRAZINAMIDA + RIFAMPICINA SUBSTANCIA_ACTIVA CICLOBENZAPRINA SUBSTANCIA_ACTIVA AZATIOPRINA SUBSTANCIA_ACTIVA PRULIFLOXACINA SUBSTANCIA_ACTIVA DARIFENACINA SUBSTANCIA_ACTIVA PARACETAMOL SUBSTANCIA_ACTIVA TERBINAFINA SUBSTANCIA_ACTIVA NICERGOLINA SUBSTANCIA_ACTIVA ACIDO ACETILSALIC ILICO + ACIDO ASCORBICO SUBSTANCIA_ACTIVA LINCOMICINA SUBSTANCIA_ACTIVA BETAMETASONA SUBSTANCIA_ACTIVA MEBENDAZOL SUBSTANCIA_ACTIVA SITAGLIPTINA 67 MEDICAMENTO Falcitrim MEDICAMENTO Malarone MEDICAMENTO Resochina MEDICAMENTO Halfan MEDICAMENTO Plaquinol MEDICAMENTO Mephaquin Lactab MEDICAMENTO Flagentyl MEDICAMENTO Tav egyl MEDICAMENTO Viternum MEDICAMENTO Drenoflux MEDICAMENTO Fenistil MEDICAMENTO Atarax MEDICAMENTO Primalan MEDICAMENTO Tinset MEDICAMENTO Fenergan MEDICAMENTO Dinaxil MEDICAMENTO Actifed MEDICAMENTO Zyrtec MEDICAMENTO Azomyr MEDICAMENTO Aerius CONDICAO_MEDICA acne nodular CONDICAO_MEDICA sonol^ encia CONDICAO_MEDICA sonolencia CONDICAO_MEDICA sono CONDICAO_MEDICA hipertens~ ao CONDICAO_MEDICA hipertensao CONDICAO_MEDICA hipertensor CONDICAO_MEDICA tens~ ao arterial CONDICAO_MEDICA tensao arterial CONDICAO_MEDICA antihipertensores CONDICAO_MEDICA asm aticos CONDICAO_MEDICA asmaticos CONDICAO_MEDICA anti-hipertensores 68 Appendix C Training Corpus B Excerpt This appendix shows an excerpt of the training corpus B. This is the training corpus which MedicineAsk currently uses to train SVM. QT INDICACOES Quais as i n d i c a ç oes da NE SUBSTANCIA ACTIVA? quais? da NE SUBSTANCIA ACTIVA s ao QT INDICACOES As i n d i c a ç oes QT INDICACOES O NE SUBSTANCIA ACTIVA é i n d i c a d o para qu e? QT INDICACOES Para que é i n d i c a d o o NE SUBSTANCIA ACTIVA? QT INDICACOES O NE SUBSTANCIA ACTIVA é i n d i c a d o em que casos? QT INDICACOES Para que serve NE MEDICAMENTO? QT INDICACOES Qual o o b j e c t i v o do NE MEDICAMENTO? QT INDICACOES Para que serve o NE MEDICAMENTO? QT INDICACOES Qual e a b u l a do NE MEDICAMENTO? QT INDICACOES O que e NE MEDICAMENTO e para que e u t i l i z a d o ? QT PRECO BARATO Quais os medicamentos mais b a r a t o s da NE SUBSTANCIA ACTIVA? QT PRECO BARATO Quais os medicamentos mais econ ómicos da NE SUBSTANCIA ACTIVA? QT PRECO BARATO Quais os medicamentos mais em conta da NE SUBSTANCIA ACTIVA ? QT PRECO BARATO Quais os medicamentos mais acess ´ v e i s da NE SUBSTANCIA ACTIVA? QT PRECO BARATO Quais os medicamentos de pre ço mais b a i x o da NE SUBSTANCIA ACTIVA? QT MEDICAMENTOSEM PRECAUCOES Quais os medicamentos para o e x i j a m precau ç oes NE CONDICAO MEDICA que n ao com o NE CONDICAO MEDICA? QT MEDICAMENTOSEM PRECAUCOES Quais os medicamentos para o tenham precau ç oes NE CONDICAO MEDICA que n ao com o NE CONDICAO MEDICA? 69 QT MEDICAMENTOSEM PRECAUCOES Quais os medicamentos para o e x i j a m cuidados com o NE CONDICAO MEDICA? NE CONDICAO MEDICA que n ao e x i j a m cuidados QT MEDICAMENTOSEM PRECAUCOES Quais os medicamentos que n ao com o NE CONDICAO MEDICA para o NE CONDICAO MEDICA? QT MEDICAMENTOSEM PRECAUCOES Quais os medicamentos para o NE CONDICAO MEDICA que adequados em caso de NE CONDICAO MEDICA? QT MEDICAMENTOGENERICOS Que medicamentos de marca g e n e r i c a possuem o NE MEDICAMENTO como s u b s t a n c i a a c t i v a ? QT MEDICAMENTOGENERICOS Qual o g e n e r i c o que s u b s t i t u i o NE MEDICAMENTO? QT MEDICAMENTOGENERICOS Quais as marcas de g e n e r i c o s de NE MEDICAMENTO e que existem ? QT MEDICAMENTOGENERICOS Quais os g e n e r i c o s do NE MEDICAMENTO? QT MEDICAMENTOGENERICOS Quais sao os g e n e r i c o s do NE MEDICAMENTO? 70