Document 12915591

International Journal of Engineering Trends and Technology (IJETT) – Volume 28 Number 4 - October 2015
Web-Based Automatic Health Information Guide
Premchand#1, Yogesh Rai#2
M.Tech. Student, 2Assistant Professor
Department of computer science, Shree institute of science and technology, Bhopal, India
Abstract: Question answering System is
characterized by its efficiency in producing the
specific answers for natural language questions. In
the time of continuous reachability of internet,
generation of answer through question answering
system can be very effective, through extracting the
required the required information from the resulted
webpages, done through web search on question.
Due to limitation of performance in complex
medical domain question; an efficient system is
highly required. Normally, websites are present to
provide the information for physician and for end
user (patient, internet user), but these systems do
not provide only the required and specific answer
but also much irrelevant information, unrequired
and irrelevant explanation which may not be
helpful. The paper illustrates how these limitations
can be overcome and select only specific
Keywords: Question answering System, natural
language, Medical domain
In the last few decades, a lot of work is done in the
development of question answering system but still
there is some space for the further development, as
they are limited to short queries. They fail to
answer the question made of more than two
sentences or performance starts to decrease
drastically when query length increases from 20-25
words. Most of the available system takes the input
in structured or semi-structured way. Existing
health information system often finds difficulties in
supporting the long user queries, which can also be
In recent time, internet users have increased
drastically. Now a day, lots of normal users are
accessing the web based health information system,
which are unable to write formal English. They
give the information in informal text and system
does not analyze the correct question and fails to
answer the question. Also, system provides a
number of irrelevant information which may be not
required by the user and sometimes if system fails
to provide the answer, it gives the additional link,
which again needs to be accessed and that
may/may not be relevant.
Usually, when a patient gets ill, it may take
some time to reach to the doctor. Meanwhile, the
situation of the patients becomes adverse, which
can be better if he/she gets the correct information.
ISSN: 2231-5381
If the user gets prior to meet the doctor about the
medical test (which may be required), patient can
save their precious time and in the first meet to
doctor, treatment will start.
Today, a user has become so clever that he gets
all the information about the problem before going
to the doctor, so can get the better treatment. And
also in clinic, he gets only the medicine for the
concerning problem. Normally, physician gives the
medicine to the patient but not the information.
While these information can save a lot of time to
the patient and helps to prevent the same problem
In this work, we aim to develop a system, made
for end user that is able to generate the most
specific information user is searching for. The
system is capable of handling very long query (like
100 words) of the user. Apart from the disease
description, we also provide the tests required,
probable symptoms and normal home remedies.
First we analyze the user‟s (very long) query
and from the question information, generate the
template. Template contains the important words,
words like disease name, symptoms, age, sex,
disease duration. We use this information in query
formation and are used for web search. From the
web, we retrieve the information. Like other
system, we do not use the user‟s question only at
the beginning but also use it for the final selection
of information. The health information accessed
from various web pages are compared again with
the help of the information from the user‟s
question, to get the most specific answer.
The following sections describes about the system
in detail. Section 2 presents a study in the existing
system. Next section (section 3) discuss about our
system. Section 4 presents the result and section 5
presents the discussion.
II. Related Works
Question answering can be considered as the
advanced form of information retrieval. The
question answering system is mainly focused on
extraction of information. For document generation
works done in generic question answering with
summary (Amigo et al., 2004). Question answering
with query focused summary (Dang, 2005).
Clustering based information retrieval (Hearst and
Pedersen, 1996; Dumais et al., 2001; Lawrie and
Croft, 2003).
Page 193
International Journal of Engineering Trends and Technology (IJETT) – Volume 28 Number 4 - October 2015
The work of the machine learning in the
medical diagnosis has started from few decades
back. An example of early medical diagnosis
system is MYCIN. MYCIN was developed in the
early years of 1970s and for the identification of
infection causing bacteria and to recommend
antibiotics based on patient‟s body weight, but it
never used in practice. In 80's several attempts have
been taken for the development of automatic
medical diagnosis system (AMDS); but majorly
these systems focus on a particular type of disease.
craniostenosis syndrome diagnosis (Baim, 1988),
dermatoglyptic diagnosis (Chan & Wong, 1989),
cardiology (Bratko et al., 1989). In 90s a number of
works have been carried out for automatic DS
development. Ces-tnik(1990) developed a system
for medical problems using m-estimate of
probabilities in Naive Bayesian classifier.
Kononenko (1991) used semi-Naive Bayesian
classifier in the detection of dependencies between
attributes. Spiegelhalter et al. (1993) developed a
system, which diagnoses the heart disease for new
born babies, using naive Bayesian classifier with
the efficiency of 67.3%.
Work on question answering related to patient
care is started from early 90‟s. Doctors used to
collect the patient information in such a ways that
are helpful to them. In 1993, Ciminoet. al.
proposed a framework that analyses the ad-hoc
medical questions like, “What is the treatment of X
disease” which can be mapped to “How to treat x
disease”. In 1999, Text REtrieval Conference
(TREC) introduced QA track, focusing to answer
the factoid questions. Since 2003, TREC has
looked on definitional questions, which needs the
complex and long answers. TREC introduced
passage retrieval for question answering in the
genomics domain. Automatically analyzing clinical
questions is an important step towards answering
clinical questions. Physicians often ask complex
and verbose questions comprising a wide variety of
types. Based on the objectives of QA track at
TREC, question answering development is mainly
focusing on improvement in answer extraction
performance. Some of well performing systems are
Pubmed, MedQA, Askhermes. They work with the
definitional question like “What is X?”, which
requires the long and complex answers. These
systems are made mainly for experts/physicians,
which gets the related informational answers
(possible tests, symptoms, precautions, etc.) about
the disease, which helps them for better treatment
to patient.
Many systems are available in the field of
medical question. The number of internet users is
increasing exceptionally but the available systems
are not developed for any general user/patient to
provide any medicinal help for every health
problem. These systems perform well but are
ISSN: 2231-5381
limited to short queries (normally word length upto 30-35). So, there is a requirement of a system,
which can work on general user/patient query,
which may be informal, very long length but
provides result for any disease.
III. Our System
We have prepared a system that provides relevant
health information as answer of a user given
diagnosis related queries. The long queries are
preprocessed, searched in the web, retrieved pages
are processed to extract the relevant information.
The overall architecture of the system is given in
Figure 1. The system consists of five core modules;
question identification, question analysis, query
generation, web search, information extraction and
finalize the coherent answer. The individual
modules are discussedbelow.
Fig 1: System Architecture
A. Question Identification:
System is designed for performing health questions
posed by any user and there is a possibility that the
question does not belongs to that type. This system
is closed-domain system and works for medical
domain or health related questions, so it is
important to identify the nature of question. This
step mainly focuses on identification of question
and detects whether a question posed by a user is a
health question. System classifies the question into
two categories: medical and non-medical. For
example, “What are the symptoms of seizures”
selected as medical question and “What
medications are prescribed for seizures” selected
as non- medical question.
To achieve this task, we have collected a
numerous of questions (posted by original users)
from different websites belonging for both the
categories: medical and non- medical. From the
questions, we have extracted some specific n-gram.
Page 194
International Journal of Engineering Trends and Technology (IJETT) – Volume 28 Number 4 - October 2015
These n-grams represent the probability of
belongingness of the medical domain and nonmedical domain. When a user submits a question,
the system searches for this specific n-gram and
sums up this value. Based on these values, it makes
the decision whether the question falls under
medical domain and whether it should be accepted.
This module accepts only the medical question and
forward for further processing, otherwise rejects.
B. Question Analysis:
Once the question is accepted, it further performed
for analysis. This part is of the system is the soul of
the system because identification of medical terms
is done, based on which final result depends. We
extract the important data and classify according to
their property, like disease, symptoms, sex,
duration, etc., the toughest job of the project.
We have observed that lots of questions are
quite lengthy and many times user use informal
English with many grammatical and spelling errors.
So we run a set of preprocessing phases which
rectify the question and extract the specific
information like disease name, sex, age, symptoms.
These information play vital role in later steps, so
are represented as template.
Every word in question is important, because
based on these, information is searched and answer
is prepared. So there should not be any spelling
error. For spelling error identification and
correction we have used English common word
dictionary and disease name dictionary. We have
used various relevant web-sources to prepare the
disease and medicine name dictionaries. If a word
is not there in any of the dictionaries, it is detected
as a misspelled word. The spelling correction is
done by using edit distance method.
Next we aim to fill a template by extracting the
appropriate entries from the user question. The
template is containing fields, namely, disease
name, symptom, sex, age, duration of suffering,
past medicine history (if any). The task of template
filing i.e. corresponding n-gram extraction is done
by defining a set of rules. Here we take the help of
the NLTK toolkit that provide basic NLP analysis
of the text like parts-of-speech, parse information,
Named Entity Recognition. Examples of the rules
used for template extraction is given below.
suffering from 20 years (yr, yrs, months): in
this sequence, the combination of „CD‟ and
„NN‟ or „NNS‟ tag word is selected as
disease duration.
from over 20 years (yr, yrs, months): in this
sequence, the combination of „CD‟ and
„NN‟ or „NNS‟ tag word is selected as
disease duration.
5 years old: in this sequence, the
combination of „CD‟ and „NN‟ or „NNS‟ tag
word is selected as age.
ISSN: 2231-5381
C. Web Search:
When a website gets a sentence to sentence to
answer, it selects only some specific words from it
based on its strategy and generates the answer
based on those selected words. So in order to get
the best result on user question, we generate the
query to minimize the question. The query can be
formed by two ways: from top most frequent and
relevant words/phrases or from the template
information. We generate the query based on
template information.
Now we search the web using the generated
query. Our objective is to extract the most
important information that is relevant to the disease
The webpages are not only containing the
information regarding the disease as a text page,
but also contains various advertisements, forms,
irrelevant links, or the required information is
spread in various inner links or tabs. And no
uniformity is there in the structure. It is not easy to
extract the required information from these
webpages of various types. Additionally, the
information that we are aiming to extract must be
authentic, so that a user can use. So, we cannot
perform search in any website (or medial website).
These information should be authentic and
regularly updated as medical.
To achieve these we have made a list of trusted
websites from where the information will be
extracted. In order to compile the list we use
various web sources and web reviews like,
Our trusted website list of this domain is containing
36 websites. Some from the list is given in Table 1.
Table 1: Trusted Websites
Now from the web search results we pick only the
pages that belong to the trusted website list.
D. Information Extraction:
System accesses the webpage in the form of html
and in order to select any sentence or word(s) their
Page 195
International Journal of Engineering Trends and Technology (IJETT) – Volume 28 Number 4 - October 2015
related tags with their information are selected. But
in start, some specific tag of irrelevant information
are selected and removed. This helps to reduce the
web page contents up-to a great extent but does not
remove any of important information and makes
easy in the identification of relevant information.
With the selection of specific tag, system selects
the most appropriate sentences from the webpage.
System extracts information like dosage of use, etc.
In order to achieve more efficient result,
currently we are searching the question query over
two website and their results can be used in some
manner to get more specific results. The searching
of query in more than one website holds better
result because sometimes a website fails to provide
any information for a disease and in that case other
website result can be helpful. In the selected
webpages, there may a lot of contents other than
the required information.
Same type of information can be available in
more than one webpages. System should not
present all same type of information as result but
only unique data. So we have made an internal list
of name of required information and after
extraction of any information, that name is
removed. This helps to search the remaining
information from other websites in less time.
4. Result and Discussion
We have worked on some set of complex questions
and these questions were from online users and
some common users. We were performed these
questions to get the relevant health information
with their relevant instructions.
For example, I am suffering from bipolar
disorder since year 2000 ,at that time i got a
maniac attack ,after that i did not get any episodes
for next 3-4 years then in year 2006 i got
depression ,and in year 2007 again depression
,then in year 2012 out of the blue moon i got
maniac attack. since then i am on medication
depakote + licab XL daily at night .I would like to
know whether continuing the tablets for long will
have any side effects ,and whether this is the best
treatment ?
The system accepts this question as medical
diagnosis question and process. During the analysis
phase, system identifies the important words like
bipolar disorder as disease name, 15 years as
disease duration and maniac attack, depression as
additional symptom. In the phase, system creates a
short query based on this information like bipolar
disorder treatment with headache, bipolar disorder
medicine. This query is searched into different
medical websites to get the health information.
Currently, we are working with two websites: and System checks
many pages of web results but selects only the
specific web link. From the different webpages
result, system only retrieves the health information
ISSN: 2231-5381
like Test required, Precautions, Disease
symptomsand Home-remedies. Our system works
better but with some limitations. If the system fails
to identify the disease name, then there will be no
result. For example, I have massive body swelling
in my legs feet and stomach, I'm not on any current
medication that could be side effects. I'm weighing
5lbs more then my regular weight. This has been
going on for approximately 6 days. Should I seek
medical attention? System accepts this question but
fails to produce the result, because unable to
identify the disease name.
We have performed our system based on the
measurement of relevant answer. We are not
considering any other measurement. Currently, the
development of our system is in an early stage, and
the data resources it has indexed are very limited.
Selection of the final prescribed medicine would be
more efficient with the extension of more data
resources. Moreover, system works perfect on
complex medical question. Currently, system is
giving limited result; providing only health
information but not the medicines (which may be
very best prescribed medicine or supplement
[1] Cao Y., Liu F., Simpson P., Antieau L., Bennett A., Cimino
J., Ely J., Yu H. (2011). AskHERMES: An Online Question
Answering System for Complex Clinical Questions. Journal
of Biomedical Informatics 44 (2011), 277–288.
[2] Spiegelhalter D.J., Philip Dawid A., Lauritzen S.L. and
Cowell R.G.1993. Bayesian analysis in expert systems.
Statistical Science, 8(3), 219-283.
[3] Voorhees EM. 1999. The TREC-8 question answering track
report. In: Proceedings of TREC.
[4] J. Ely, J. Osheroff, M. Ebell, G. Bergus, B. Levy,M.
Chambliss, and E. Evans. 1999. Analysis of questions asked
by family doctors regarding patient care. BMJ, 319, 358–
[5] Yu H., and Yong-gang Cao. 2008. Automatically Extracting
Information Needs from Ad Hoc Clinical Questions. AMIA
Annual Symposium Proceedings. 2008. 96–100.
[6] Igor Kononenko.2001.Machine learning for medical
diagnosis: history, state of the art and perspective. Artificial
Intelligence in Medicine. August 2001.Volume 23, Issue
1,Pages 89–109.
[7] Blair-goldensohn S., R. Mckeown K., Schlaikjer A.H.
2003.A Hybrid Approach for QA Track Definitional
Questions. In Proceeding of the 12th Annual Text Retrieval
[8] Hersh W, Cohen A, Ruslen L, Roberts P. 2007. In the TREC
genomics track conference 2007.
[9] Hersh W, Cohen A, Roberts P, Rekapalli H. 2006. In the
TREC genomics track conference 2006.
[10] Yu H, Lee M, Kaufman D, Ely J, Osheroff JA, Hripcsak G,
Cimino J. 2007. Development, implementation, and a
cognitive evaluation of a definitional question answering
system for physicians.Journal of Biomedical Informatics.
June 2007. 40 (3).236–251.
[11] Cao Y.G., James J. Cimino, Ely J., Yu H. 2010.
Automatically extracting information needs from complex
clinical questions. Journal of Biomedical Informatics 43
(2010) 962–971.
Page 196