Research Proposal - University of South Australia

advertisement
University of South Australia
School of Computer and Information Science
Bachelor of Software Engineering
Research Proposal
Discover Patterns in Adverse
Drug Reaction
Name: Ernst J Joham
ID Number: 10005126
SUPERVISOR: DR JIUYONG LI
: DR JAN STANEK
ABSTRACT
This research will use medical data to investigate and find patterns through data mining for
adverse drug reaction. Wilson, Thabane and Holbrock (2003) define data mining as the
importance of extracting valid, unknown and actionable information from databases.
According to Furey (2005) ‘each year 2.2 million Americans suffer serious adverse
reactions to drugs which are referred to as Adverse Drug Reaction (ADR)’. The World
Health Organization (2002) overview of adverse events clearly highlights this importance
and describes these adverse events as fatal, life-threatening and permanently/significantly
disabling, requires or prolongs hospitalization. By using data mining to discover patterns
involving factors such as age, height, and weight with certain conditions or taking different
drugs together it can lead to outcomes that cause adverse events. The purpose of the
research is to try to discover patterns through data mining on a far ideal dataset data set
that contains noise and missing values. Two core questions are explored: (1) is it possible
to discover patterns in spares datasets? , and (2) what patterns can be identified through
data mining for ADR? This research project will seek answers to these questions using prerecorded data. The data being used will provide real-world evidence for detecting adverse
drug reaction. An interpretative quantitative methodology will be used. The research will
involve data sorting through approximately twelve thousand existing records and the
selection of relevant information. R statistical package will be use to find patterns and
interpret communalities. R (R Project for Statistical Computing) software is an open source
package with functional language capabilities allowing graphical display and statistical
exploration from datasets. Once the results are obtained an in-depth analysis and
interpretation of the data will take place. Our conclusion to the research will determine if a
far from ideal data set can be mined with certain techniques that are more suitable for
medical datasets.
ii
DECLARATION
I declare the following to be my own work, unless otherwise referenced, as
defined by the University’s policy on plagiarism.
Ernst J Joham
iii
TABLE OF CONTENTS
1.
INTRODUCTION .................................................................... 1
1.1
1.2
1.3
1.4
BACKGROUND ........................................................................................ 1-2
MOTIVATION .............................................................................................. 2
RESEARCH OBJECTIVE AND STUDY QUESTIONS ........................................... 2
THESIS STRUCTURE ................................................................................... 3
2.
LITERATURE REVIEW ....................................................... 4-7
3.
METHODOLOGY ................................................................... 8
3.1.
3.2.
3.3.
3.4.
DATASET .................................................................................................. 8
RESEARCH PROCESS ............................................................................ 9-10
DATA MINING TOOL ............................................................................ 11-12
ALGORITHMS ........................................................................................... 12
4
SCHEDULE .......................................................................... 13
REFERENCE ................................................................................. 14
iv
1. INTRODUCTION
1.1
Background
Discovering patterns in medical datasets is still very difficult and challenging
but very rewarding (Roddick & Graco 2003). Compared to other fields if you
can data mine medical datasets it will also work for any dataset. There are a lot
more constraints and issues that limit the way the data mining is undertaken for
medical datasets. Some of these issues facing medical data is the why the data
is collected; accuracy of the data, ethical, legal and social issues that comes
with patients records (Cios & Moore 2002).
The World Health Organization (2002) reports that some countries the
admission due to ADRs is more than 10%. The growing problem of these
medical morbidity and mortality has a high financial burden on hospitals. This
growing problem needs to be addressed by monitoring system and other
alternatives.
Data mining can be one of these alternatives in helping detect ADRS by
following a data mining process and using certain techniques in extracting
patterns in medical datasets to identify the cause of adverse events that are lifethreatening, and prolong hospitalization.
Data mining techniques have improved from when data mining began and with
the introduction of databases, but the database does not benefit the health
professional(s) until the information is turned into useful information. By using
effective data mining tools and algorithms and a step by step data mining
process it is possible to produce useful and new information from the dataset
(Wilson, Thabane & Holbrook 2003).
This thesis attempts to explore using data mining techniques in discovering
patterns in medical data. There are many issues that make it difficult for mining
medical data and a need to overcome this complexity is important. By using
1
medical datasets, data mining techniques and technologies are pushed to their
limits (Roddick & Graco 2003). This aspect will test the effectiveness of
various algorithms used in evaluating these results.
1.2
Motivation
The motivation for this project is my personal interest in data mining and the
challenges that is involved with today’s knowledge discovery in databases. With
the project I hope to discover patterns of interest by using low quality medical
data. There is a clear need for more research into data mining of medical
applications as little research so far has been published. Data quality and issues
with medical datasets does impact the end result of patterns discovered.
A lot of techniques these days already have mechanisms built in to help with
noise and missing values. In this research a number of algorithms will be tested
to see if they can handle a data set that is far from ideal to data mine. For the
project R statistical tool will be used for the data mining process. Reason for use
of R is that it is an open source tool and also has the benefit of a programming
language. It is also a widely used tool by many data mining professionals. The
patterns discovered are interpreted and a conclusion will be made on the
soundness of the algorithms.
1.3
Research Objective and Study Questions
The aim of this research is to use data mining methods in an attempt to produce
relevant results from real world data. The interpretation of the results from this
research will determine if data sets that are faced with issues and constraints like
noisy, incompleteness and limitation on attributes can still produce patterns of
interest.
The following research questions for this thesis will be addressed:
(1) Is it possible to discover patterns in spares datasets?
(2) What patterns can be identified through data mining for ADR?
2
1.4
Thesis Structure
The layout for the thesis is as follows:
Section 2 is an overview of the literature. It will review current studies
conducted in the area of data mining when it come to noisy, incomplete and
data that is generally hard to extract patterns because of issues with the data.
Also best techniques used for this kind of data will be reviewed.
Section 3 describes the methodology used for this research. Includes an
overview of the data used for the project. Data mining tools for the analyzing of
the dataset and the techniques used in producing the models, and results.
Section 4 provides an overview of data mining and the process involved for a
data mining project. A look into some of the likely techniques used for data
mining is also looked at.
Section 5 answers the research questions. Interpreting the models is attempted
and discussions about the results are made.
Section 6 this chapter is a summary of the entire study conducted, limitations
that also affected the study and suggestions for future research.
3
2. LITERATURE REVIEW
With the growth of data mining and finding informative information in datasets it is not
surprising that more research is needed in data quality and effective data mining
algorithms to be able to detect interesting relationships within the dataset. There are still
relatively few publications and research done for data mining especially for medical
datasets with noise and missing values. Several studies have focused on the problems
encountered with datasets and best techniques to be used when data mining medical
applications. For example Cios & Moore (2002) addresses the difficulty and constraints
of collecting medical data to mine and the technical and social reasons behind missing
values in the data set. Study by Brown & Kros (2003) focuses further on the impact of
missing data and how existing methods can help with the problems of missing data.
They categories methods for dealing with missing data into:

Use complete data only

Delete selected case or variables

Data imputation

Model-based approaches
Before any of these methods can be applied to the data set the analyst must understand
each type of missing values only then can a discussion be made in how to address them
(Brown & Cros 2003).Types of missing values can be of type data missing at random,
Data missing completely at random, non-ignorable missing data, and outliers treated as
missing data (Brown & Cros 2003).
Another alternative approach to handling missing values is by conceptual reconstruction
where only conceptual aspects of the data are mined from the incomplete data set
(Aggarwal & Parthasarathy 2001). They further argue that some of the methods like
data imputation are prone to errors. Aggarwal & Parthasarathy (2001) gives an example
where in table1 it shows how entries that are missing 20% to 40% in the data set. When
using the conceptual reconstruction method the first three were 92% accurate as the
original data sets.
4
Dataset
Cao
BUPA
62.4
0.963
0.927
Musk (1)
76.2
0.943
0.92
Musk (2)
95.0
0.96
0.945
Letter Recognition
84.9
0.825
0.62
CAM(20%)
CAM(40%)
Table 1 Conceptual reconstructed data sets (Aggarwal & Parthasarathy 2001)
Other Studies have gone further with impact of missing values and explore the impact
of noise and how this can influence the output of models. Zhu & Wu (2004) puts these
into class noise and attributes noise. Their research concentrated on attribute noise as
class noise is much cleaner them first thought (Zhu & Wu 2004). Attribute noise is
more difficult to handle and include:
(1) Incorrect attribute values
(2) Missing or don’t know attribute values
(3) Incomplete attributes or don’t care values
Some researchers have focused on data cleansing tools to help eliminate noise but this
can only achieve a reasonable result (Zhu & Wu 2004). Noise handling methods can
help to eliminate noise in data sets. Hulse et al (2007) introduces the Pair wise Noise
Attribute Detection Algorithm (PANDA) that can detect attribute noise within datasets
allowing the removal of noisy data only if required. The other algorithm introduced is
the (DM) distance-based outlier detection technique which is similar but not as good as
PANDA in detecting attribute noise. When the noise is detected then we can remove it
or if not removed it may cause a low quality set of hypotheses. Table 2 displays the
result of a dataset using PANDA and Dm. PANDA identifies more noise instances.
Instance category
Noise
Outliers
Exceptions
Typical
1–10
PANDA DM
6
6
2
4
2
0
0
0
11–20
PANDA DM
7
4
2
6
1
0
0
0
21–30
PANDA DM
8
8
1
2
1
0
0
0
1–30
PANDA DM
21
18
5
12
4
0
0
0
Table 2 10% of a dataset of 30 most suspicious instances (Hulse et al 2007)
5
Several researches have focused on the techniques that have built in mechanism to
handle noise and missing values and which are more appropriate to use for medical
applications. Laverač (1999) reviews a number of techniques that have been applied and
are more suited to medical data sets. These include decision tree, logic programs, Knearest neighbour, and Bayesian classifiers. Laverač (1999) describes these as
‘intelligent data analysis techniques in the extraction of knowledge, regularities, trend
and representation cases from patients data stored in medical records’. Lee et al (2000)
believes that techniques that users can easily extract specific knowledge are the key for
making medical decisions and studies have concluded that Bayesian networks and
decision trees are the primary techniques applied in medical information systems.
Fayyad et al (Lee et al 1999, p.85) indicates that the diverse fields for knowledge
discovery draw upon the main components and methods shown in figure 1.
Figure 1 Main components of KDD and DM and there relationship (Lee et al 1999)
A study on drug discovery Obenshain (2004) showed that neural networks performed
better then logistic regression, but the decision tree did better in identify active
compounds most likely to have biological activity.
Other researchers into data mining for medical datasets have focused on data mining
process which includes dealing with missing values, noise and choosing the techniques
for knowledge discovery. Cios & Moore (2002) acknowledges that it is important for
6
medical data mining to follow a procedure for success in knowledge discovery. These
can follow a few steps like a nine-step process or the DMKD process which adds
several steps to the CRISP-DM model and has been applied to several medical problem
domains. Figure 2 shows how the process model works which can be semi-automated
for medical applications (Cios & Moore 2002).
Figure 2 DMKD process model (Cios & Moore 2002)
Wang (2008) argues that most process models focus on the results but not in gaining
new knowledge. Medical data mining applications is expected to discover new
knowledge and should follow a five stage data mining development cycle: planning
tasks, developing data mining hypotheses, preparing data, selecting data mining tools,
and evaluating data mining results.
Current literature has focused on ways to improve data sets by applying methods for
missing values and noise. Not many methods have been applied on medical data sets.
The same with techniques where tests have been done, but still there is room for further
research into techniques that when using real-world medical data sets for data mining.
This study will further investigate ways for a successful outcome of discovering
patterns in a medical data set. The CRISP-DM data mining process will be used and R
statistical package tool for handling noise and missing values. Zhu & Wu (2004)
indicate that powerful tools can greatly assist in the data cleansing process which are
cost effective are necessary and may help to achieve data quality level for data mining.
A number of algorithms will also be tested on the medical data set to see how well they
can perform on the data set that contains noise.
7
3. METHODOLOGY
3.1. Dataset
The dataset for the project is a pre-record dataset provide by external clients who are
kept anodynes. Also because of the confidentiality, ethical and legal issues in the
dataset there was a necessity to remove sensitive information before we were able to
view and use the data. There are a total of 1286 records of patients with ADR that
will be used for the data mining project.
The information in the dataset included characteristic of patient and drugs for
adverse drug reactions. The information that was made available in the dataset
includes:

Date when the patient was admitted for ADR.

Age record in days

Brand is the generic drug for the main drug

Drug that was given to the patient

Route of administration

Probability of the drug being the cause of ADR

Severity of the ADR

Recovered or not

UR number which includes patients details

ATC Anatomical Therapeutic Chemical is a classification system for drugs
It is worth nothing that, due to the limited attributes, incomplete and missing
information only a few attributes were chosen for use.
8
3.2. Research Process
The project uses the data mining method of CRISP_DM where the consortium uses
a six step data mining process as shown in figure 1.
Figure 1: CRISP-DM – six step process model (CRISP-DM, 2000)
Understand the business this is where the project was reviewed by the client,
supervisor and team member as which direction we were going to take and what
was the goal of the project. The main aim of this research is to test techniques to see
if patterns are formed using a sparse dataset.
Understand the dataset for this stage the dataset was reviewed by using Rattle tool
to give a summary of the attributes as a whole and query each attributes separately
to visualise the data in various format to aid in the decision which attribute to keep
for further analysis. Since the attributes for this dataset was limited a few attributes
stood out more and were considered for the next phase.
Data preparation this is where the data went through two extra processes, Data
Cleaning and Data Transformation all done in the R tool because of the ease of
use of scripts to carry out the data cleansing and transformation. The objective for
this phase was to decide on the structure of the data for the next phase. Five
attributes were chosen they included Date, Age in Days, Route, Recovered, and
ATC code for the drug. These attributes were chosen in consideration of giving a
better result for modelling. Table 1 shows attributes abbreviation name and given
values.
9
Variable
Abbreviation
Date when the patient was admitted ADRDATE
to hospital for ADRs (OctoberMarch =1, April-September = 0)
How old the patient is categorised
into equal number of records. (0-2
years old = 1, 2-5 years old = 2, 5- AGE
11 years old = 3, 11-16 years old =
4, and above 16 years of age = 5)
The
administration
of
the
medication that caused the ADR is
ROUTE
either oral or intravenous.(Oral = 1,
Intravenous = 0)
Recovered
from
ADRs
or RECOV
not.(Recovered = 0, Not recovered
= 1)
The drugs given to the patient either ATC
are
classified
antibiotics
or
not.(Antibiotics =1, Not Antibiotics
=0)
Table 1 shows the binary values that the attributes were given.
Modelling phase for the process included the decision of selecting the most
appropriate algorithms for the research which for this study included logistic
regression, decision tree, and risk pattern.
Evaluation phase was the last phase for the project where the models were
interpreted and the results determined if the project objectives were met. Due to time
constraint the results of the three techniques were used to answer the project
objectives and the first three phases were only completed once.
10
3.3. Data Mining Tool
The data mining tool chosen for the project is R package for statistical computing
and graphics with programming capabilities, and Rattle a user interface that can be
combined with R package. These tools can be run on a variety of platforms
including UNIX, Windows, and MacOS and R also allows binding with other
languages such as Python, XML, Soap, and Perl. Both of these packages are under
the free software environment and provide a sophisticated way of performing data
mining. A screenshot of the R and Rattle tools is shown in figure 2.
Figure 2: R and Rattle tool for data mining screenshot.
Rattle is used by many governments and private organisations around the world
including the Australian Taxation office and is being adopted by a number of
colleges and university in teaching data mining.
The R and Rattle combined provides a good set of data mining algorithms for
modelling selection. They include cluster, association rules, liner models, tress, and
neutral models. Besides the models there is the variety of ways for visualizing the
data like histograms, plots. Also data form of almost any source can be loaded and
used.
11
Most of the data preparation was done in R by using Scripting language and the
decision tree and logistic regression was modelled using Rattle. The only other
algorithm used for the project was Mining Risk Patterns. The software for this
algorithm was run on Linux 9.0 platform.
3.4. Algorithms
The data mining techniques adopted for the project included logistic regression,
decision tree, and risk pattern mining algorithm. Each of these techniques provides
their own unique way of analysing the medical dataset that was provided.
Decision tree and logistic regression have been applied and used across a wide range
of applications including medical applications. Ji et al (2009, p. 2) in reporting
Andrews study, emphasizes the benefits of logistic regression and decision tree
method for ‘identifying commonalities and differences in medical databases
variables. The risk pattern algorithm has also been applied to medical data for
patients on ACE inhibitors who have an allergic event (Li et al, 2005). As this
project explores the use of medical dataset to detect adverse drug reaction it was
important to use techniques that are reliant and have proven to work in similar
studies.
The difference between the techniques is that logistic regression is appropriate when
variables are of two possibilities (0, 1) and variables with multiple categories. This
makes the logistic regression method useful for this study in determining whether
patient’s medical details given have any association of the patient not recovering
from adverse drug reactions. Where else the decision tree is also well suited to
binary values but can also be modelled with more than two values and can easily be
understood by people because of the tree like structure and leaf nodes that can easily
be analysed to determine the patterns given. The last algorithm ‘makes use of antimonotone property to efficiently prune searching space’ (Li et al, 2005). The optical
risk pattern mining returns the highest relative risk pattern among the patterns
discovered. This model is easily interpreted and shows the odds ratio, risk ratio and
the fields associated with the pattern.
12
4. SCHEDULE
Activities
Date
Description
Project Plan
August 25, 2008
Ongoing until end of first semester
SRS Document
August 25, 2008
Ongoing until end of first semester
Test Plan
August 25, 2008
Ongoing until end of first semester
Data preparation
August – October Clean and preparation of dataset
2008
Thesis proposal
November 07,2008
Research presentation
Modelling
November–
Modelling of dataset
December 2008
Proof of concept
March 30, 2009
Produce framework and process
description
User documentation
May 12, 2009
User guide for the project
Test Results
May 15,2009
Tests results for the techniques
used on the dataset
Final Technical Report
May 30, 2009
Deployment (final report for the
data mining project).
Research proposal
June 16,2009
Final written proposal
Research paper
August 7, 2009
Final
Research presentation
September 4,2009
Final
13
REFERENCE
Aggarwal CC & Srinivasan, P 2001, Mining massively incomplete data sets by
conceptual reconstruction, ACM, San Francisco, California.
Brown, ML & Kros, JF 2003, 'Data mining and the impact of missing data', Industrial
Management & Data Systems, vol. 103, pp. 611-621.
Cios, K 2002, 'Uniqueness of medical data mining', Artificial intelligence in medicine,
vol. 26, no. 1-2, pp. 1-24.
CRISP_DM 2000, Cross Industry Standard Process for Data Mining, viewed 27 August
2008, <http://www.crisp-dm.org/Partners/index.htm>.
Li, J, Fe, AW-c, He, H, Chen, J, Jin, H, McAullay, D, Williams, G, Sparks, R &
Kelman, C 2005, Mining risk patterns in medical data, ACM, Chicago, Illinois, USA.
Lavrač, N 1999, 'Selected techniques for data mining in medicine', Artificial intelligence
in medicine, vol. 16, no. 1, pp. 3-23.
Lee, I-N, Liao, S-C & Embrechts, M 2000, 'Data mining techniques applied to medical
information', Medical Informatics & the Internet in Medicine, vol. 25, no. 2, pp. 81-102.
Obenshain, MK 2004, ‘Application of Data Mining Techniques to Healthcare Data’,
Infection Control and Hospital Epidemiology, vol.25, no 8, pp. 690-695.
Roddick, JF, Fule, P & Graco, WJ 2003, 'Exploratory medical knowledge discovery:
experiences and issues', SIGKDD Explor. Newsl., vol. 5, no. 1, pp. 94-99.
Safety of Medicines 2002, A Guide to Detecting and Reporting Adverse Drug
Reaction Why Health Professionals Need to Take Action, WHO publications, viewed
15 April 2008, < http://whqlibdoc.who.int/hq/2002/WHO_EDM_QSM_2002.2.pdf>.
Wang, H & Wang, S 2008, 'Medical knowledge acquisition through data mining', paper
presented at the IT in Medicine and Education, 2008. ITME 2008. IEEE International
Symposium on, Xiamen
Wilson, AM, Thabane, L & Holbrook A 2003, 'Application of data mining techniques in
pharmacovigilance', British Journal of Clinical Pharmacology, vol. 57, no. 2, pp. 127134.
Zhu, X, Khoshgoftaar, T, Davidson, I & Zhang, S 2007, 'Editorial: Special issue on
mining low-quality data', Knowledge and Information Systems, vol. 11, no. 2, pp. 131136
14
15
Download