University of South Australia School of Computer and Information Science Bachelor of Software Engineering Research Proposal Discover Patterns in Adverse Drug Reaction Name: Ernst J Joham ID Number: 10005126 SUPERVISOR: DR JIUYONG LI : DR JAN STANEK ABSTRACT This research will use medical data to investigate and find patterns through data mining for adverse drug reaction. Wilson, Thabane and Holbrock (2003) define data mining as the importance of extracting valid, unknown and actionable information from databases. According to Furey (2005) ‘each year 2.2 million Americans suffer serious adverse reactions to drugs which are referred to as Adverse Drug Reaction (ADR)’. The World Health Organization (2002) overview of adverse events clearly highlights this importance and describes these adverse events as fatal, life-threatening and permanently/significantly disabling, requires or prolongs hospitalization. By using data mining to discover patterns involving factors such as age, height, and weight with certain conditions or taking different drugs together it can lead to outcomes that cause adverse events. The purpose of the research is to try to discover patterns through data mining on a far ideal dataset data set that contains noise and missing values. Two core questions are explored: (1) is it possible to discover patterns in spares datasets? , and (2) what patterns can be identified through data mining for ADR? This research project will seek answers to these questions using prerecorded data. The data being used will provide real-world evidence for detecting adverse drug reaction. An interpretative quantitative methodology will be used. The research will involve data sorting through approximately twelve thousand existing records and the selection of relevant information. R statistical package will be use to find patterns and interpret communalities. R (R Project for Statistical Computing) software is an open source package with functional language capabilities allowing graphical display and statistical exploration from datasets. Once the results are obtained an in-depth analysis and interpretation of the data will take place. Our conclusion to the research will determine if a far from ideal data set can be mined with certain techniques that are more suitable for medical datasets. ii DECLARATION I declare the following to be my own work, unless otherwise referenced, as defined by the University’s policy on plagiarism. Ernst J Joham iii TABLE OF CONTENTS 1. INTRODUCTION .................................................................... 1 1.1 1.2 1.3 1.4 BACKGROUND ........................................................................................ 1-2 MOTIVATION .............................................................................................. 2 RESEARCH OBJECTIVE AND STUDY QUESTIONS ........................................... 2 THESIS STRUCTURE ................................................................................... 3 2. LITERATURE REVIEW ....................................................... 4-7 3. METHODOLOGY ................................................................... 8 3.1. 3.2. 3.3. 3.4. DATASET .................................................................................................. 8 RESEARCH PROCESS ............................................................................ 9-10 DATA MINING TOOL ............................................................................ 11-12 ALGORITHMS ........................................................................................... 12 4 SCHEDULE .......................................................................... 13 REFERENCE ................................................................................. 14 iv 1. INTRODUCTION 1.1 Background Discovering patterns in medical datasets is still very difficult and challenging but very rewarding (Roddick & Graco 2003). Compared to other fields if you can data mine medical datasets it will also work for any dataset. There are a lot more constraints and issues that limit the way the data mining is undertaken for medical datasets. Some of these issues facing medical data is the why the data is collected; accuracy of the data, ethical, legal and social issues that comes with patients records (Cios & Moore 2002). The World Health Organization (2002) reports that some countries the admission due to ADRs is more than 10%. The growing problem of these medical morbidity and mortality has a high financial burden on hospitals. This growing problem needs to be addressed by monitoring system and other alternatives. Data mining can be one of these alternatives in helping detect ADRS by following a data mining process and using certain techniques in extracting patterns in medical datasets to identify the cause of adverse events that are lifethreatening, and prolong hospitalization. Data mining techniques have improved from when data mining began and with the introduction of databases, but the database does not benefit the health professional(s) until the information is turned into useful information. By using effective data mining tools and algorithms and a step by step data mining process it is possible to produce useful and new information from the dataset (Wilson, Thabane & Holbrook 2003). This thesis attempts to explore using data mining techniques in discovering patterns in medical data. There are many issues that make it difficult for mining medical data and a need to overcome this complexity is important. By using 1 medical datasets, data mining techniques and technologies are pushed to their limits (Roddick & Graco 2003). This aspect will test the effectiveness of various algorithms used in evaluating these results. 1.2 Motivation The motivation for this project is my personal interest in data mining and the challenges that is involved with today’s knowledge discovery in databases. With the project I hope to discover patterns of interest by using low quality medical data. There is a clear need for more research into data mining of medical applications as little research so far has been published. Data quality and issues with medical datasets does impact the end result of patterns discovered. A lot of techniques these days already have mechanisms built in to help with noise and missing values. In this research a number of algorithms will be tested to see if they can handle a data set that is far from ideal to data mine. For the project R statistical tool will be used for the data mining process. Reason for use of R is that it is an open source tool and also has the benefit of a programming language. It is also a widely used tool by many data mining professionals. The patterns discovered are interpreted and a conclusion will be made on the soundness of the algorithms. 1.3 Research Objective and Study Questions The aim of this research is to use data mining methods in an attempt to produce relevant results from real world data. The interpretation of the results from this research will determine if data sets that are faced with issues and constraints like noisy, incompleteness and limitation on attributes can still produce patterns of interest. The following research questions for this thesis will be addressed: (1) Is it possible to discover patterns in spares datasets? (2) What patterns can be identified through data mining for ADR? 2 1.4 Thesis Structure The layout for the thesis is as follows: Section 2 is an overview of the literature. It will review current studies conducted in the area of data mining when it come to noisy, incomplete and data that is generally hard to extract patterns because of issues with the data. Also best techniques used for this kind of data will be reviewed. Section 3 describes the methodology used for this research. Includes an overview of the data used for the project. Data mining tools for the analyzing of the dataset and the techniques used in producing the models, and results. Section 4 provides an overview of data mining and the process involved for a data mining project. A look into some of the likely techniques used for data mining is also looked at. Section 5 answers the research questions. Interpreting the models is attempted and discussions about the results are made. Section 6 this chapter is a summary of the entire study conducted, limitations that also affected the study and suggestions for future research. 3 2. LITERATURE REVIEW With the growth of data mining and finding informative information in datasets it is not surprising that more research is needed in data quality and effective data mining algorithms to be able to detect interesting relationships within the dataset. There are still relatively few publications and research done for data mining especially for medical datasets with noise and missing values. Several studies have focused on the problems encountered with datasets and best techniques to be used when data mining medical applications. For example Cios & Moore (2002) addresses the difficulty and constraints of collecting medical data to mine and the technical and social reasons behind missing values in the data set. Study by Brown & Kros (2003) focuses further on the impact of missing data and how existing methods can help with the problems of missing data. They categories methods for dealing with missing data into: Use complete data only Delete selected case or variables Data imputation Model-based approaches Before any of these methods can be applied to the data set the analyst must understand each type of missing values only then can a discussion be made in how to address them (Brown & Cros 2003).Types of missing values can be of type data missing at random, Data missing completely at random, non-ignorable missing data, and outliers treated as missing data (Brown & Cros 2003). Another alternative approach to handling missing values is by conceptual reconstruction where only conceptual aspects of the data are mined from the incomplete data set (Aggarwal & Parthasarathy 2001). They further argue that some of the methods like data imputation are prone to errors. Aggarwal & Parthasarathy (2001) gives an example where in table1 it shows how entries that are missing 20% to 40% in the data set. When using the conceptual reconstruction method the first three were 92% accurate as the original data sets. 4 Dataset Cao BUPA 62.4 0.963 0.927 Musk (1) 76.2 0.943 0.92 Musk (2) 95.0 0.96 0.945 Letter Recognition 84.9 0.825 0.62 CAM(20%) CAM(40%) Table 1 Conceptual reconstructed data sets (Aggarwal & Parthasarathy 2001) Other Studies have gone further with impact of missing values and explore the impact of noise and how this can influence the output of models. Zhu & Wu (2004) puts these into class noise and attributes noise. Their research concentrated on attribute noise as class noise is much cleaner them first thought (Zhu & Wu 2004). Attribute noise is more difficult to handle and include: (1) Incorrect attribute values (2) Missing or don’t know attribute values (3) Incomplete attributes or don’t care values Some researchers have focused on data cleansing tools to help eliminate noise but this can only achieve a reasonable result (Zhu & Wu 2004). Noise handling methods can help to eliminate noise in data sets. Hulse et al (2007) introduces the Pair wise Noise Attribute Detection Algorithm (PANDA) that can detect attribute noise within datasets allowing the removal of noisy data only if required. The other algorithm introduced is the (DM) distance-based outlier detection technique which is similar but not as good as PANDA in detecting attribute noise. When the noise is detected then we can remove it or if not removed it may cause a low quality set of hypotheses. Table 2 displays the result of a dataset using PANDA and Dm. PANDA identifies more noise instances. Instance category Noise Outliers Exceptions Typical 1–10 PANDA DM 6 6 2 4 2 0 0 0 11–20 PANDA DM 7 4 2 6 1 0 0 0 21–30 PANDA DM 8 8 1 2 1 0 0 0 1–30 PANDA DM 21 18 5 12 4 0 0 0 Table 2 10% of a dataset of 30 most suspicious instances (Hulse et al 2007) 5 Several researches have focused on the techniques that have built in mechanism to handle noise and missing values and which are more appropriate to use for medical applications. Laverač (1999) reviews a number of techniques that have been applied and are more suited to medical data sets. These include decision tree, logic programs, Knearest neighbour, and Bayesian classifiers. Laverač (1999) describes these as ‘intelligent data analysis techniques in the extraction of knowledge, regularities, trend and representation cases from patients data stored in medical records’. Lee et al (2000) believes that techniques that users can easily extract specific knowledge are the key for making medical decisions and studies have concluded that Bayesian networks and decision trees are the primary techniques applied in medical information systems. Fayyad et al (Lee et al 1999, p.85) indicates that the diverse fields for knowledge discovery draw upon the main components and methods shown in figure 1. Figure 1 Main components of KDD and DM and there relationship (Lee et al 1999) A study on drug discovery Obenshain (2004) showed that neural networks performed better then logistic regression, but the decision tree did better in identify active compounds most likely to have biological activity. Other researchers into data mining for medical datasets have focused on data mining process which includes dealing with missing values, noise and choosing the techniques for knowledge discovery. Cios & Moore (2002) acknowledges that it is important for 6 medical data mining to follow a procedure for success in knowledge discovery. These can follow a few steps like a nine-step process or the DMKD process which adds several steps to the CRISP-DM model and has been applied to several medical problem domains. Figure 2 shows how the process model works which can be semi-automated for medical applications (Cios & Moore 2002). Figure 2 DMKD process model (Cios & Moore 2002) Wang (2008) argues that most process models focus on the results but not in gaining new knowledge. Medical data mining applications is expected to discover new knowledge and should follow a five stage data mining development cycle: planning tasks, developing data mining hypotheses, preparing data, selecting data mining tools, and evaluating data mining results. Current literature has focused on ways to improve data sets by applying methods for missing values and noise. Not many methods have been applied on medical data sets. The same with techniques where tests have been done, but still there is room for further research into techniques that when using real-world medical data sets for data mining. This study will further investigate ways for a successful outcome of discovering patterns in a medical data set. The CRISP-DM data mining process will be used and R statistical package tool for handling noise and missing values. Zhu & Wu (2004) indicate that powerful tools can greatly assist in the data cleansing process which are cost effective are necessary and may help to achieve data quality level for data mining. A number of algorithms will also be tested on the medical data set to see how well they can perform on the data set that contains noise. 7 3. METHODOLOGY 3.1. Dataset The dataset for the project is a pre-record dataset provide by external clients who are kept anodynes. Also because of the confidentiality, ethical and legal issues in the dataset there was a necessity to remove sensitive information before we were able to view and use the data. There are a total of 1286 records of patients with ADR that will be used for the data mining project. The information in the dataset included characteristic of patient and drugs for adverse drug reactions. The information that was made available in the dataset includes: Date when the patient was admitted for ADR. Age record in days Brand is the generic drug for the main drug Drug that was given to the patient Route of administration Probability of the drug being the cause of ADR Severity of the ADR Recovered or not UR number which includes patients details ATC Anatomical Therapeutic Chemical is a classification system for drugs It is worth nothing that, due to the limited attributes, incomplete and missing information only a few attributes were chosen for use. 8 3.2. Research Process The project uses the data mining method of CRISP_DM where the consortium uses a six step data mining process as shown in figure 1. Figure 1: CRISP-DM – six step process model (CRISP-DM, 2000) Understand the business this is where the project was reviewed by the client, supervisor and team member as which direction we were going to take and what was the goal of the project. The main aim of this research is to test techniques to see if patterns are formed using a sparse dataset. Understand the dataset for this stage the dataset was reviewed by using Rattle tool to give a summary of the attributes as a whole and query each attributes separately to visualise the data in various format to aid in the decision which attribute to keep for further analysis. Since the attributes for this dataset was limited a few attributes stood out more and were considered for the next phase. Data preparation this is where the data went through two extra processes, Data Cleaning and Data Transformation all done in the R tool because of the ease of use of scripts to carry out the data cleansing and transformation. The objective for this phase was to decide on the structure of the data for the next phase. Five attributes were chosen they included Date, Age in Days, Route, Recovered, and ATC code for the drug. These attributes were chosen in consideration of giving a better result for modelling. Table 1 shows attributes abbreviation name and given values. 9 Variable Abbreviation Date when the patient was admitted ADRDATE to hospital for ADRs (OctoberMarch =1, April-September = 0) How old the patient is categorised into equal number of records. (0-2 years old = 1, 2-5 years old = 2, 5- AGE 11 years old = 3, 11-16 years old = 4, and above 16 years of age = 5) The administration of the medication that caused the ADR is ROUTE either oral or intravenous.(Oral = 1, Intravenous = 0) Recovered from ADRs or RECOV not.(Recovered = 0, Not recovered = 1) The drugs given to the patient either ATC are classified antibiotics or not.(Antibiotics =1, Not Antibiotics =0) Table 1 shows the binary values that the attributes were given. Modelling phase for the process included the decision of selecting the most appropriate algorithms for the research which for this study included logistic regression, decision tree, and risk pattern. Evaluation phase was the last phase for the project where the models were interpreted and the results determined if the project objectives were met. Due to time constraint the results of the three techniques were used to answer the project objectives and the first three phases were only completed once. 10 3.3. Data Mining Tool The data mining tool chosen for the project is R package for statistical computing and graphics with programming capabilities, and Rattle a user interface that can be combined with R package. These tools can be run on a variety of platforms including UNIX, Windows, and MacOS and R also allows binding with other languages such as Python, XML, Soap, and Perl. Both of these packages are under the free software environment and provide a sophisticated way of performing data mining. A screenshot of the R and Rattle tools is shown in figure 2. Figure 2: R and Rattle tool for data mining screenshot. Rattle is used by many governments and private organisations around the world including the Australian Taxation office and is being adopted by a number of colleges and university in teaching data mining. The R and Rattle combined provides a good set of data mining algorithms for modelling selection. They include cluster, association rules, liner models, tress, and neutral models. Besides the models there is the variety of ways for visualizing the data like histograms, plots. Also data form of almost any source can be loaded and used. 11 Most of the data preparation was done in R by using Scripting language and the decision tree and logistic regression was modelled using Rattle. The only other algorithm used for the project was Mining Risk Patterns. The software for this algorithm was run on Linux 9.0 platform. 3.4. Algorithms The data mining techniques adopted for the project included logistic regression, decision tree, and risk pattern mining algorithm. Each of these techniques provides their own unique way of analysing the medical dataset that was provided. Decision tree and logistic regression have been applied and used across a wide range of applications including medical applications. Ji et al (2009, p. 2) in reporting Andrews study, emphasizes the benefits of logistic regression and decision tree method for ‘identifying commonalities and differences in medical databases variables. The risk pattern algorithm has also been applied to medical data for patients on ACE inhibitors who have an allergic event (Li et al, 2005). As this project explores the use of medical dataset to detect adverse drug reaction it was important to use techniques that are reliant and have proven to work in similar studies. The difference between the techniques is that logistic regression is appropriate when variables are of two possibilities (0, 1) and variables with multiple categories. This makes the logistic regression method useful for this study in determining whether patient’s medical details given have any association of the patient not recovering from adverse drug reactions. Where else the decision tree is also well suited to binary values but can also be modelled with more than two values and can easily be understood by people because of the tree like structure and leaf nodes that can easily be analysed to determine the patterns given. The last algorithm ‘makes use of antimonotone property to efficiently prune searching space’ (Li et al, 2005). The optical risk pattern mining returns the highest relative risk pattern among the patterns discovered. This model is easily interpreted and shows the odds ratio, risk ratio and the fields associated with the pattern. 12 4. SCHEDULE Activities Date Description Project Plan August 25, 2008 Ongoing until end of first semester SRS Document August 25, 2008 Ongoing until end of first semester Test Plan August 25, 2008 Ongoing until end of first semester Data preparation August – October Clean and preparation of dataset 2008 Thesis proposal November 07,2008 Research presentation Modelling November– Modelling of dataset December 2008 Proof of concept March 30, 2009 Produce framework and process description User documentation May 12, 2009 User guide for the project Test Results May 15,2009 Tests results for the techniques used on the dataset Final Technical Report May 30, 2009 Deployment (final report for the data mining project). Research proposal June 16,2009 Final written proposal Research paper August 7, 2009 Final Research presentation September 4,2009 Final 13 REFERENCE Aggarwal CC & Srinivasan, P 2001, Mining massively incomplete data sets by conceptual reconstruction, ACM, San Francisco, California. Brown, ML & Kros, JF 2003, 'Data mining and the impact of missing data', Industrial Management & Data Systems, vol. 103, pp. 611-621. Cios, K 2002, 'Uniqueness of medical data mining', Artificial intelligence in medicine, vol. 26, no. 1-2, pp. 1-24. CRISP_DM 2000, Cross Industry Standard Process for Data Mining, viewed 27 August 2008, <http://www.crisp-dm.org/Partners/index.htm>. Li, J, Fe, AW-c, He, H, Chen, J, Jin, H, McAullay, D, Williams, G, Sparks, R & Kelman, C 2005, Mining risk patterns in medical data, ACM, Chicago, Illinois, USA. Lavrač, N 1999, 'Selected techniques for data mining in medicine', Artificial intelligence in medicine, vol. 16, no. 1, pp. 3-23. Lee, I-N, Liao, S-C & Embrechts, M 2000, 'Data mining techniques applied to medical information', Medical Informatics & the Internet in Medicine, vol. 25, no. 2, pp. 81-102. Obenshain, MK 2004, ‘Application of Data Mining Techniques to Healthcare Data’, Infection Control and Hospital Epidemiology, vol.25, no 8, pp. 690-695. Roddick, JF, Fule, P & Graco, WJ 2003, 'Exploratory medical knowledge discovery: experiences and issues', SIGKDD Explor. Newsl., vol. 5, no. 1, pp. 94-99. Safety of Medicines 2002, A Guide to Detecting and Reporting Adverse Drug Reaction Why Health Professionals Need to Take Action, WHO publications, viewed 15 April 2008, < http://whqlibdoc.who.int/hq/2002/WHO_EDM_QSM_2002.2.pdf>. Wang, H & Wang, S 2008, 'Medical knowledge acquisition through data mining', paper presented at the IT in Medicine and Education, 2008. ITME 2008. IEEE International Symposium on, Xiamen Wilson, AM, Thabane, L & Holbrook A 2003, 'Application of data mining techniques in pharmacovigilance', British Journal of Clinical Pharmacology, vol. 57, no. 2, pp. 127134. Zhu, X, Khoshgoftaar, T, Davidson, I & Zhang, S 2007, 'Editorial: Special issue on mining low-quality data', Knowledge and Information Systems, vol. 11, no. 2, pp. 131136 14 15