An Efficient Approach for Extraction of Disease

advertisement
An Efficient Approach for Extraction of
Disease-Treatment Relation
Ancy Sudhakar
M-tech Computer Science and Engineering,
KMCT College of Engineering Calicut, Calicut University
Calicut City, India
aachasudhakar@gmail.com
Abstract: The Machine learning and the natural language processing techniques are used
in various medical domains. These methods are used for extracting relevant information in
the medical domains. In this paper we describe an efficient health care system for
identification of relations between diseases and treatments. In addition to this it also
identifies the symptoms associated with a particular disease. The Multinominal Naive
Bayes algorithm is used for this extraction method. Other than this the stop word removal
process and stemming algorithm is used. With the help of this proposed system we can
make efficient medical decisions.
Keywords: Health care system, Machine learning, Medline, Natural Language Processing,
Stop words Removal.
I. INTRODUCTION
Health is the level of functional or metabolic efficiency of living organism. We know that
people are more concerned about their health. Now a day’s there are lots of tools such as
Google Health and Microsoft Health Vault that helps us in managing our health. But even
though we have such tools they have lots of disadvantages such as they always find
difficulties in extracting information from a group of data and also their classification
performance is very weak. So we need a system that eliminates such problems. The proposed
system eliminates such problems and provides us more reliable access to information.
In this work we make use of various medical related databases such as Medline in which all
relevant medical articles are published. We know that people have no time for reading the
complete article to know about a particular disease and its treatment. So they need a system
that extracts relevant information from these articles more easily and efficiently and the
proposed system provides these facilities. For this we make use of the text mining approach.
This system extracts information regarding diseases, treatments, symptoms and the three
semantic relations between diseases and treatments such as cure, prevent and side effect
relations. For extracting such information firstly all unwanted information contained in the
article must be eliminated. For this different method such as stop word removal, repeated
words removal and Multinominal Naive Bayes algorithms are used.
The methods used in this system provide the users only informative sentences. So that the
users especially the doctors need not have to spend much time in analyzing about a particular
disease by reading these articles. In addition to this the classification performance is very
much improved.
II. LITERATURE SURVEY
The most important work here is the work done by Rosario and Hearst [1] and they are the
ones who have distributed the datasets used in this work. The dataset used include sentences
from Medline abstract annotated with diseases and treatments. In their work they have made
use of Hidden Markov Model and also this paper deals with the mapping of biomedical
information into structural format. They deal with seven relations such as cure, prevent, side
effect, only disease, only treatment, no cure and vague relations. For extracting the semantic
relations they make use of Naive Bayes algorithm.
The method of information extraction is very costly and the method of decreasing the cost of
information extraction is described in the work done by M. Craven [2]. In this work they deal
with the method of information extraction via relational learning and information extraction
via text classification. In this the Naive Bayes classifier with bag of words representation is
used for classification purpose and this approach is a statistical approach. Relations between
different entities are dealt by Ray and Craven [3] .In their work they find relation between
genes and disorders and also they find the location of particular protein within a cell.
Various relations are extracted via medical subject heading and via semantic representation in
the work done by Srinivasan and Rindflesch [4]. In this work exploration of some text mining
opportunities offered by Medline database is done. In some cases AdaBoost classifier [5]
with bag of words representation is used. In these cases the whole abstract is considered as a
source of information than considering a single sentence. Different other methods such as
pre-processing, parsing and error recovery [6] are used for identification of relation between
two different entities.
The representation techniques such as bag of words and different classification algorithms
such as Naive Bayes, Compliment Naive Bayes, Support Vector Machine etc are specified in
the work done by Oana Frunza and Diana Inkpen [7]. In this work they find the performance
of different classifiers and they make use of precision, recall and f-measure for this purpose.
The direct and indirect comparison between different classifiers is done. Different classifiers
are compared with the baseline classifier.
III. PROPOSED APPROACH
The medical databases such as Medline contain lots of medical related articles. The main
problem with these databases is that all research discoveries come and enter the repository at
high rate making the process of identifying and disseminating the healthcare information a
difficult task [7]. Here the proposed approach eliminates such problem and introduces a new
method for easily extracting relevant medical information from these articles. The proposed
system displays to the user information regarding diseases, treatments and symptoms and in
addition it also identifies three semantic relations between diseases and treatments.
Always an article contains both relevant and irrelevant information. So firstly we have to
avoid this irrelevant information from these articles and find the required information from
the remaining relevant data. The elimination or removal of irrelevant information makes the
process of text mining approach much easier. Different methods are used for this removal of
unwanted information.
The first step which is used for removal of irrelevant information is stop word removal
process. In this stop word removal we remove all stop words such as a, an, is etc from the
article and after the stop word removal we remove the repeated words with the help of
stemming algorithm. This both steps reduce the content of the document but it increases the
quality of the result. The figure 1 below shows the system architecture of the proposed
system:
Figure 1. The System Architecture of Proposed System
All the process employed here are performed in a pipelined manner, First the stop word
removal process. In this removal all stop words such as a, an, is, the etc are removed. There
are about 174 English stop words and all are removed from the text file. By stop word
removal even though the size of the document is reduced the quality of result is improved.
The document obtained after the removal of stop words is then input for the removal of
repeated words. For the repeated words removal we make use of stemming algorithm. There
are lots of stemming algorithms and we make use of suffix stripping stemming algorithm. For
example the document may contain the words like connected, connecting etc and the
stemming algorithm reduces this to its stem as connect.
After the removal of repeated words the next step is to apply the Multinominal Naive Bayes
algorithm for finding the relations between diseases and treatments. This also identifies the
symptoms associated with a particular disease in addition to the three relations cure, prevent
and side effect relations.
In earlier days the Naive Bayes algorithm was used for finding the semantic relations
between two different entities. As there are some drawbacks associated with this Naive Bayes
algorithm we make use of this Multinominal Naive Bayes algorithm for text classification.
This Multinominal Naive Bayes algorithm is a specialized version of Naive Bayes algorithm.
The main drawback with the Naive Bayes algorithm is that it assumes that the attributes are
all independent of each other given the class. For example consider the classifier for in
assessing the risk of issuing a credit card. In this case the Naive Bayes algorithm assumes that
there is no relation or dependency between a customer’s age, income, education and salary
provided the customer is classified as a worthy customer. So to avoid these problems we
make use of the Multinominal Naive Bayes algorithm.
The above proposed approach helps in saving time of various users especially doctors in
analyzing about a particular disease. The methods used in this system can also be used for
finding the relation between different other entities and it can be used in various other
applications. The results can be found out with the help of precision, recall and f-measure
values. The important modules used in this system are described below in detail.
A. Relevant Information Extraction from Medical Articles
There are lots of medical databases in which medical related articles are published.
For example Medline database, which contains all the relevant medical articles. From
these databases we extract only the relevant information. For this purpose we make
use of the bag of words representation. In this representation technique each word is
considered as the feature for training the classifier which we use.
1. Removal of stop words:
Stop words are words which are filtered out prior to or after processing of
natural language data. There are about 174 English stop words for example: a,
an, the etc. These stop words are removed from the document. By removal of
these stop words the content of the document is reduced but the quality of the
result is improved.
2. Repeated words removal:
After the removal of stop words next the repeated word removal process is
done. Here the stemming algorithm is used for this purpose. The suffix
stripping algorithm is the stemming algorithm used here. It strips away some
of the ending terms and produces the base or stem form of the word. In some
cases suffix substitution is used. For example the presence of the words
expressed, expressing etc are reduced to its base form express.
B. Sentence Identification and Relation Identification Process
In sentence identification process the sentences containing information disease,
treatment and symptoms are identified. Next step is to find the relation that exists
between diseases and treatments. The three semantic relations are identified such as
cure, prevent and side effect relation. For this we make use of the Multinominal Naive
Bayes algorithm. Multinominal Naive Bayes algorithm takes into account the number
of times a word occurs in a document or a text file. A training set and test set is
required for the relation identification task. The training set to train the system and the
test set to test the performance of the algorithm.
C. Output Performance Evaluation
The output performance of the system is evaluated for different other documents. The
result obtained shows information regarding diseases, treatments and symptoms and
the three semantic relations between diseases and treatments. The result we obtain is
more efficient and accurate.
IV. CONCLUSION
The proposed approach provides us only informative sentences by avoiding all uninformative
information. This system helps various users especially doctors in making certain medical
decisions. In addition it helps doctors in analyzing about a particular disease and its
treatments. Other than these entities we can find relationship between different other entities.
In various healthcare domains and other medical applications we can make use of this
system.
REFERENCES
1. B. Rosario and M.A. Hearst, “Semantic Relations in Bioscience Text,” Proc. 42nd
Ann. Meeting on Assoc. For Computational Linguistics, vol. 430,2004.
2. M. Craven, “Learning to Extract Relations from Medline,” Proc. Assoc. for the
Advancement of Artificial Intelligence, 1999.
3. S. Ray and M. Craven, “Representing Sentence Structure in Hidden Markov Models
for Information Extraction,” Proc. Int’l Joint Conf. Artificial Intelligence (IJCAI ’01),
2001.
4. P. Srinivasan and T. Rindflesch, “Exploring Text Mining from Medline,” Proc. Am.
Medical Informatics Assoc. (AMIA) Symp., 2002.
5. O. Frunza and D. Inkpen, “Textual Information in Predicting Functional Properties of
the Genes,” Proc. Workshop Current Trends in Biomedical Natural Language
Processing (BioNLP) in conjunction with Assoc. for Computational Linguistics
(ACL ’08), 2008.
6. C. Friedman, P. Kra, H. Yu, M. Krauthammer, and A. Rzhetsky, “GENIES: A Natural
Language Processing System for the Extraction of Molecular Pathways from Journal
Articles,” Bioinformatics, vol.17, pp. S74-S82, 2001.
7. Oana Frunza.et.al, “A Machine Learning Approach For Identifying DiseaseTreatment Relations In Short Texts”, May 2011.
Ancy Sudhakar received her B-tech degree in Computer Science and Engineering from
College of Engineering Vadakara, Cochin University, Kerala, India and is presently doing
her final year Master of Technology from KMCT College of Engineering Calicut, Calicut
University, Kerala, India. Her research interest includes Data Mining and Knowledge
Engineering.
Download