Survey on Association Rule Mining Algorithms in Medical Domain Archana Sasi, Vidhu Kiran G & Sony P Dept. of Computer Science College of Engineering Cherthala Cherthala, India archana.sasi2k8@gmail.com , vidhukiran5690@gmail.com, spsony@gmail.com Abstract— Medical field is an ever growing and evolving field of study in which lot of researches and findings take place. The medical knowledge discovery from databases can play a vital role in effective medical diagnosis. Availability and extraction of useful medical information is a crucial factor in this field. The association rule mining, one of the data mining technique helps in finding relevant relationships among various terms present in diverse documents. Over these years many ideas have been put forward for the generation of itemsets. In these ideas, the concern is on how efficiently such algorithms can generate these frequent itemsets. This study, is an analysis of such various frequent itemsets generating algorithms which are useful in medical domain. The majority of medical data is static in nature and thus the concerned algorithms need only to handle static data. The study overlooks few such algorithms that efficiently handle the medical data which includes Apriori, DHP, Eclat, FP-growth etc. Keywords—Association Rule, Apriori, DHP, Eclat, Frequent Itemsets mining, , FP-growth I. INTRODUCTION Recently there has been a drastic increase in the size of databases at a great rate. This has led to a growing interest in the development of tools capable in the automatic extraction of knowledge from data [6]. Data mining is a collection of techniques for efficient automated discovery of previously unknown patterns and associations, by automatically extracting information within the databases, which can be either semi-structured or unstructured. The issue of mining frequent itemsets arose first as a sub-problem of mining association rules. Frequent itemsets play a vital role in many data mining applications that try to find interesting patterns from databases such as association rules [6], clusters, classifiers, correlations and many more. Since the evolution of life on earth, there had been lots of inventions going on in the field of medicine. For helping biomedical researchers, lots of efficient and important resources like MEDLINE, PubMed, SNOMED CT, etc. are available today. MEDLINE [7] includes its citations and summary of biomedical information’s from the world. The PubMed [7] provides free access to the MEDLINE[15] database of references and abstracts on biomedical informations. SNOMED CT[16] offers codes, synonyms, terms and definitions from collection of medical term which are systematically organized and processable by computers[8]. Along with these inventions, the trends in information technology are arising. So it is essential to produce relevant association rules between medical terms in medical documents. In this study, we make use of a domain ontology called Unified Medical Language System (UMLS) to recognize medical terms. These documents are mined to produce association rules between medical terms such as disease, medications, symptoms etc. Here, we explain the basic frequent itemset and association rule mining problems that occur in the medical data sets. We also describe the main techniques used to solve these problems. II. MEDICAL DATA FORMATS Clinical Medical Records contains plenty of information that can prove indispensable for the performance of clinical research. These reports are preserved in free textual form. Some such reports include Electronic Medical Records (EMR), Electronic Health Records (EHR), Electronic Patient Records (EPR). Though these terms are often used interchangeably, they can be defined differently. EMR is a methodical group of electronic health information about a single patient or population. It is the digital format of the patient records created at hospitals and similar environments [1]. This can serve as a data source for the EHR. Whereas EHR is a systematic collections of electronic information generated and maintained within an institution like hospitals for accessing patient’s medical records across amenities. EPR is an organizational record of treatment generated when a patient is treated. Since the medical data available are not always structured hence pre-processing of these data is essential to improve the number of inconsistencies associated with data and also to improve the quality of data [9]. The medical reports available are either unstructured or semistructured. So it is important to convert them into structured format before applying association rule mining algorithms. The data set in concern here is commonly known as the Electronic Medical Reports (EMR)[9]. These are actually medical reports that are stored in the electronic file format. The EMRs normally follow a defined structure. If the EMRs are properly structured then the application of association rules is much easier task to dealt with. At the same time it becomes very difficult to analyze when EMR is not in proper structure [1]. Hence it is of great importance to obtain a medical dataset in structured format. Although mentioned that structured medical reports are required for the efficient data mining it’s a far cry and not yet completely achievable. The availability of good structured medical reports and that too in bulk amount is not feasible due to current situations. This is because generating and running clinical reports represents a major price to the health care enterprise. So rather than going for the completely structured data sets, it is easier to use the semi-structured medical reports which are easily available. mining techniques used in medical domain include decision tree algorithms[18] such as ID3[19], C4.5 and CART for finding decision support in medical databases[17][18], outlier prediction technique for improving classification accuracy[9], combined use of K-means, SOM, and Naïve Bayes for accurate classification of medical data[12], Fuzzy Cognitive Maps[11] for classification of drugs and health effects etc. Data mining applications can deeply helps all parties involved in the Medical Field. For example, data mining can help Medical Field insurers to spot fraud and mistreatment, physicians recognize effective treatments and best practices, and patients get better and additional affordable Medical Field services. The vast amount of data generated by Medical Field transactions are very difficult and huge to be processed and analyzed by conventional methods. Data mining provides the technologies and methods to vary these mass of data into helpful information for decision making [9]. .Data mining applications can be developed to estimate the effectiveness of medical treatments. By examining the causes, symptoms, and modes of treatments, data mining can fetch an examination of which courses of action prove efficient. IV. III. TEXT MINING IN MEDICAL FIELD With the advancement in the web, methodological documentations, digital libraries, medical data etc, the access to a huge amount of textual documents becomes more capable [5]. The abundance and uncertainty of natural language makes the text mining more challenging. The mining and examination of useful information from those documents written in natural language is very hard in nature[6][8]. Knowledge Discovery from Databases (KDD) denotes extracting interesting and non-trivial relations and patterns from data in large databases[8]. Data mining activities mainly consists of description and visualizations, finding association between data elements, clustering of similar records, classification of data[9], regression, prediction based on trends that can be extracted from data etc,[18]. Medical field has enough sources to generate large and dynamic data warehouses which are hard to analyze by professionals without the aid of computers [4]. There are numerous well established data mining techniques in the medical field concerning healthcare management, evaluating effectiveness of treatments, clinical trials, electronic patient records, etc.. Data mining approaches such as clustering, data classification, regression, association rule mining, CART are widely used in health care domain [9]. In spite of the clear benefits, there exist a lot of limitations for data mining analysis techniques. Due to the distribution of data in different settings, the accessibility of data becomes difficult. The available Data may be imperfect, incomplete degraded or inconsistent [18]. Data mining techniques have shown significant improvement in medical industry in terms of prediction and decision making with respect to various diseases like cancer, renal abnormalities, diabetes and others [9]. Some of the data ROLE OF ONTOLOGY IN MEDICAL DOMAIN In recent years various ontologies have been developed in biomedical field with an intend to represent biomedical terms in a common vocabulary [16]. Some of thel ontologies used in biomedical domains include OpenCyc, WordNet[10], GALEN, UMLS, SNOMED-CT[8], FMA, Geneontology etc. Since, most of the medical terms are multi-word phrases, it is not easier to explore all combinations of sequential words in the sentence from a fixed domain ontology [5][8]. Rather, we use POS tagger and then employ an ordered patterns to get a list of candidate terms and then see if they exist in the ontology [5][6]. Here, in this study we are using a medical ontology called Unified Medical Language System (UMLS) produced by the United States National Library Of Medicine (NLM)[20]. UMLS has been customized using Metamorphosys which contains a huge number of vocabularies that defines vast number of medical terms. Concept Unique Identifier (CUI) present in one of the data file of UMLS called MRCONSO is used to discover the semantic type of the term [22]. One of the UMLS Knowledge Sources called The SPECIALIST Lexicon provides the word usage information needed for the SPECIALIST NLP system. These system tools are designed to help users manage lexical variation in biomedical text. A range of relations that keep among lexemes and their senses include Homonymy, Hyponymy, Hypernymy, Polysemy, Meronymy etc[21]. Homonymy is defined as a relation that holds between words that have the same form with unrelated meanings. Hypernymy is the semantic relation in which one word is the hypernym of another. It shares a kind of is a relation, whereas hyponymy is the converse relation of hypernymy. A meronymy is a semantic relation that represents constituent part of whole. Finally, if a term is present in the UMLS database, we then keep it and continue to search for other terms from current position. V. PROBLEM STUDY Advancements in digital libraries, medical data, web, technical documentation has all resulted in the availability of large amount of textual documents [5]. All these textual data comprises of useful and important information which is worth harnessing. Manual analysis and effective extraction of this information is not feasible. So it is essential to discover valuable relationships between the various in the textual documents. Association rule mining [10][24] helps to achieve these important relationships. Medical Diagnosis is one of the major beneficiaries of data mining, especially association rule mining [3]. Even though the medical sector is loaded with lots of information but still the knowledge that we can attain from this information is very limited and of poor quality. The data mining applications can considerably change the situation. The general diagnosis process normally involves a lot of hypothetical and probabilistic assumptions .It is by using association rules that these assumptions can be turned into a pattern which there by assists physicians in easier disease diagnosis and accurate treatments. The task of mining frequent itemsets[16] arose first as a sub-problem of mining association rules, but this task is somewhat challenging. The search space is exponential in the number of terms occurring in the database. Also, such databases could be massive, involving millions of documents, making counting support a tedious task. To compute the supports of a collection of itemsets, we need to access the database. An important consideration in most algorithms is the representation of the document database [5]. The two most commonly used layout for representing database matrix are horizontal data layout and vertical data layout. Breadth First Search Algorithms uses horizontal data layout and Depth First Search Algorithms uses vertical data layout [23]. A. Breadth-First Search Algorithms A1. Apriori Algorithm One of the most popular algorithm that actually solve the complete association rule mining problem is called Apriori algorithm [4]. Since the frequent set mining is a sub-problem in association rule mining phase, this frequent itemset mining is somewhat a difficult phase in Apriori algorithm [18]. The algorithm performs a breadth-first search approach through the search space of all sets by iteratively generating and counting a group of candidate sets [4]. An itemset can be treated as a candidate if all of its subsets are frequent. Since frequent itemset mining is a challenging phase in this algorithm, after finding all frequent itemsets, we can easily generate the confident association rule. It seems that this algorithm does not fully exploit the monotonicity of confidence [13]. Fortunately, if the number of frequent and confident association rules is not too big, then the time required to find all such rules consists of the time that was needed to find all frequent sets. Optimizations After the introduction of the Apriori algorithm, a number of other algorithms were proposed, which keeps the same general structure, but by adding certain techniques to optimize some steps within the algorithm. Since the performance of the this algorithm is completly enforced by the procedure of counting support, most researches has concentrated on that aspect. For counting the occurrences of the candidate itemsets, there is no need to examine the whole database each time. This reduces the time required to count the support of candidate itemsets, and there by improves the performance. Transaction pruning was present in the following two algorithms: AprioriTid and DHP. A2. AprioriTid, AprioriHybrid Along with the proposal of the Apriori algorithm, Agrawal et.al introduced two new algorithms, AprioriTid and AprioriHybrid. The AprioriTid algorithm reduces the time required for the support counting procedure. This is done by replacing every document in the database by the set of candidate itemsets that occur in that document. The main difference from Apriori is that, it does not use the whole database to count the support of candidate itemsets . Although this algorithm is much faster in later iterations, in early iterations it behaves much slowly. Moreover, at later iterations, each entry may be lesser than the corresponding document because very few candidates may be present in that document. However, in early iterations, every entry may be larger than its corresponding document. This instigated the authors of the algorithm to propose another algorithm that joins the Apriori and AprioriTid algorithms into a single hybrid called AprioriHybrid. This algorithm uses Apriori for the first stages and then switches to AprioriTid when document pruning becomes more powerful. When to switch between the algorithms is not much vivid, but the authors showed that it can be resolved heuristically. However, AprioriHybrid carryout almost always better than Apriori. A3. DHP Algorithm Direct Hashing and Pruning algorithm proposes overcoming some of the drawbacks of the Apriori algorithm by reducing the number of candidate k-itemsets, especially, the 2-itemsets, since that is the key for improving the performance. The DHP algorithm claims to be powerful in the generation of frequent itemsets and effective in trimming the document database by discarding terms from the documents or eliminating the whole documents that need not to be scanned .The algorithm uses a hash based technique to reduce the number of candidate itemsets generated in the initial pass. This algorithm may be partitioned into the following three parts. The first part finds all the frequent 1-itemsets and all the candidate 2-itemsets.The second part is the more common part that includes hashing and the third without the hashing. Both these parts include pruning. Part two is used for early iterations and part three is used for later iterations. One of the impacts on the effectiveness of the algorithm is the reduction in the database. Because of this efficient pruning of the database, the algorithm can attain a significantly shorter execution time than the Apriori algorithm. B. Depth First Search Algorithms B1. Eclat Algorithm Eclat algorithm is essentially a depth–first search algorithm. Instead of explicitly listing all documents[4] , it uses a vertical database layout. That is, here each term is stored together with its cover and in order to compute the support of a itemset[5] it uses an intersection based approach . The main advantage of using vertical data layout is that, the support of a set can be computed much easier by intersecting the covers of two of its subsets which together give the set itself. Here each frequent term generated is added in the output set. Eclat algorithm doesn’t totally exploit the monotonicity property[4] , but the number of candidate itemsets that are generated is much larger as compared to the breadth-first search algorithms. On comparing with the previously defined algorithms it is obvious that, only by using the join step from the Apriori, that the Eclat generates the candidate itemsets. This is because the itemsets necessary for the prune step are not available. Again, to reduce the number of candidate itemsets generated we can reorder all the terms in the document database in ascending order of the support value. And thereby we can reduce the number of intersections that wants to be calculated and thus the total size of covers of all generated itemsets. It is at every recursion step of the algorithm that this reordering is actually performed .On comparing with Apriori, the supports of all itemsets can be calculated more efficiently. B2. FP-GROWTH One of the most popular algorithm that mines frequent patterns[14] without candidate generation is called FP-Growth Algorithm. This algorithm uses an approach that is entirely different from that used by methods based on Apriori algorithm. For storing database in main memory, FP-growth uses a combination of the horizontal and vertical database layout to compute the supports of all generated itemets[5]. To remove the bottlenecks of the Apriori-Algorithm in generating and testing candidate itemset was the main goal of this algorithm. Instead of storing the cover for every term in the database, it stores the actual documents from the database in a tree structure. Here every term has a linked list going through all documents that contain that term [3]. This new data structure generated is denoted by FP-Tree or Frequent-Pattern Tree [15]. The major difference between frequent patterngrowth (FP-Growth) and the other algorithms is that FPgrowth instead of generating the candidates itemsets, it only tests. But in contrast, the Apriori algorithm generates the candidate itemsets and then tests. The motivation for the FP-tree method is as given below: To find the association rules, only the frequent terms are required, so it best to find the frequent terms and ignore others. If a compact structure is used to store the frequent items, then the original document database does not need to be used frequently. The FP-growth algorithm is somewhat similar to Eclat algorithm. To conserve the FP-tree structure it uses some additional steps during the recursion step of Eclat. One advantage of the FP-tree algorithm is that it ignores scanning the database more than twice to calculate the support counts. Another benefit of using this algorithm is that is that, it completely eliminates the expensive candidate generation, which can be costly in particular for the Apriori algorithm for the candidate set C2.The FP-growth algorithm is better than the Apriori algorithm when the document database is large and the minimum support count is low. A low minimum support count results a lot of terms to satisfy the support count and hence the size of the candidate sets for Apriori becomes large. FP-growth uses a more efficient structure to mine patterns when the database grows. The only difference between Eclat and FP-growth is the method in which they count the supports of every candidate itemset and how they represent and retain the i-projected database. VI. PERFORMANCE EVALUATION The data set consists of 400 patient medical transcription reports from the domain of kidney disease were collected for the study. The report was in semi-structured format. The task was to find the number of rules that contain Chronic Kidney Disease. The general architecture of the system is shown below: PRECISION-RECALL GRAPH The Input to the system consists of both structured and semi-structured documents. In the initial phase these are converted into structured documents. It is in the second phase that the association rules are generated by first generating the frequent itemsets using Éclat algorithm. Here, we used precision and recall for evaluating the system performance. The table below shows the precision and recall value obtained for finding the association rules using above mentioned algorithms and graph showing these values. On analyzing these results it is clear that the number of rules generated by depth first search algorithms always outperforms breadth first search algorithms. Among the breath first search algorithms Apriori Tid produce more number of rules. Eventhough DHP performs good during initial stages, but due to hash collision it generate less rules later. On comparing depth first search algorithms Eclat produce more number of rules than FP Growth. VII. CONCLUSIONS ARM ALGORITHAMS PRECISION RECALL APRIORI 83.98% 82.14% APRIORI TID 85.11% 83.24% DHP 79.16% 78.29% FP GROWTH 88.45% 86.26% ECLAT 91.34% 89.83% The main aim of the paper was to verify various algorithms that perform the association rule mining. Here the medical data was subjected to association rule mining using various algorithms. It was also observed that the result of mining was different for different algorithms. Though their variance was not by a huge factor as well, the result was based on the performance of each algorithm on the same amount of semi-structured medical reports. Hence it is also mentionable that the results might vary accordingly to the type of medical reports that is provided as input to these algorithms. Here in this paper we presented the comparative performance study of breath first search and depth first search algorithms. This study includes a detailed analysis of these algorithms, which contribute a great to the search of improving the efficiency of frequent item set mining. This can be also used for comparing the other algorithms which can lead to some ideas for optimizations. . VIII. REFERENCES [1] Xu Yabin and Peng Hongyu “Disease Association Discovery Based on XML EMR” Second International Wokshop on Education Technology and Computer Science(ETCS) Volume 3, March 2010, ISBN: 978-1-4244-6389-3 [2] Patil B.M ,Joshi R.C and Toshniwal D “Association Rule for classification of Type-2 Diabetic Patients”, Second IEEE International Conference on Machine Learning and Computing (ICMLC), Feb 2010, ISBN- 978-1-4244-6007-6 [3] Xiaohua Zhou, Hyoil Han, Isaac Chankai, Ann A. Prestrud and Ari D. Brooks “Converting Semi-structured Clinical Medical Records into Information and Knowledge” Proceedings of the 21st International Conference on Data Engineering , April 2005, ISBN: 0-7695-2657-8 Tianxia Gong, Chew Lim Tan, Tze Yun Leong, Cheng Kiang Lee, Boon Chuan Pang, C. C. Tchoyoson Lim, Qi Tian, Suisheng Tang, Zhuo Zhang “Text Mining in Radiology Reports” Eighth IEEE International Conference on Data Mining, 2008, ISSN: 1550-4786 [4] [5] Hany Mahgoub, Dietmar Rosner, Nabil Ismail and Fawzy Torkey “A Text Mining Technique Using Association Rules Extraction” International Journal Of Information And Mathematical Sciences, 2008, ISBN: 978-1-4577-2033-8 [6] Vaishali Bhujade and N.J Janwe, “Knowledge Discovery in Text Mining Technique Using Association Rules Extraction”, International Conference on Computational Intelligence and Communiaction Networks, Oct 2011, ISBN: 978-1-4577-2033-8 K. Bretonnel Cohen, Wendy W. Chapman, “Current issues in biomedical text mining and natural language processing” in Journal of Biomedical Informatics 42 (2009) [8] Tasha R.Inniss, John R.Lee, Mare Light, Michael A.Grassi,George Thomas, Andrew B.Williams,”Towardds Applying Text Mining and Natural Language Processing for Biomedical Ontology Acquisition”, Proceedings of the First International Workshop on Text Mining in Bioinformatics, ISBN:-1-59593-526-6 [7] [9] Vili Podgorelec, Marjan HerikoMaribor, (n.d).” Improving Mining of Medical Data by Outliers Prediction”, Proceedings of 18th IEEE Symposium on June 2005,ISSN: 1063-7125 [10] George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine Miller “Introduction to WordNet: An On-line Lexical Database” 2010 Second International Conference on Machine Learning and Computing [11] Wojciech Froelich, Alicja Wakulicz-Deja (2009).,” Mining Temporal Medical Data Using Adaptive Fuzzy Cognitive Maps.” Human System Interactions,2009 HIS’09 .2nd Conference on May 2009, E-ISBN: 9781-4244-3960-7 [12] Syed Zahid Hassan and Brijesh Verma,” A Hybrid Data Mining Approach for Knowledge Extraction and Classification in Medical Databases” IEEE Seventh International Conference on Intelligent Systems Design and Applications, Oct 2007, ISBN: 978-0-7695-2976-9 [13] Umair Abdullah (2008).” Analysis of Effectiveness of Apriori Algorithm in Medical Billing Data Mining”. ICET 2008 Fourth International Conference, Oct 2008,E-ISBN: 978-1-4244-2211-1 [14] Xiaoyong Lin and Qunxiong Zhu. (2010). “Share-Inherit: A novel approach for mining frequent patterns”, IEEE Eigth world Congress on Intelligent Control and Automation (WCICA), July 2010, ISBN: 978-14244-6712-9 [15] Carson Kai-Sang Leung,Christopher L. Carmichael and Boyu Hao. (2007). “Efficient Mining of Frequent Patterns from Uncertain Data”. Seventh IEEE International Conference on Data Mining Workshops (ICDM),ct 2007, E-ISBN: 978-0-7695-3033-8 [16] Thanh-Trung Nguyen. (2010). “An Improved Algorithm for Frequent Patterns Mining Problem”, International Symposium on Computer Communication Control and Automation, May 2010, ISBN: 978-14244-5565-2 [17] Jenn-Lung Su, Guo-Zhen Wu, I-Pin Chao (2001). “The Approach Of Data Mining Methods For Medical Database”.,.Proceedings of the 23rd Annual International Conference of the IEEE on Engineering in Medicine and Biology Society, Vol 4, 2001, ISSN: 1094-687X [18] Sam Chao, Fai Wong, “An Incremental Decision Tree Learning Methodology regarding Attributes In Medical Data Mining”. Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12-15 July 2009, E-ISBN: 978-1-4244-3703-0 [19] Lehnert,W.,Soderland S., Aronow, ., Feng, F., and Shmueli,A., “Inuctive Text Classification for Medical Applications”, Journal for Experimental and Theoretical Artificial Intelligence, 1994, 7(1),pp,49-80. [20] UMLS Reference Manual – Bethesda (MD): National Library of Medicine September 2009. [21] Danushka Bollegala, Yutaka Matsuo and Mitusuru Ishizuka,”A Relational Model of Semantic Similarity between Words using Automatically Extracted Lexical Pattern Clusters from the web”. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing [22] Alan R.Aronson,”Effective Mapping of Biomedical Text to the UMLS Metathesaurus: The MetaMap Program”. In:Proeedings of the American Medical Informatics Association (AMIA) Symposium ;2001.p.17-21 [23] Jochen Hipp, Ulrich G¨untzer and Gholamreza,” Algorithms for Association Rule Mining – A General Survey and Comparison”. ACM SIGKDD, July 2000 Volume 2 [24] Haiwei Pan, Qilong Han, Guishng Yin and Wei Zhang“Association rule mining with domain knowledge constraint”,Third International Conference on Intelligent System and Knowledge Engineering Volume:1, Nov 2008, E-ISBN: 978-1-4244-2197-8 [25] R.K.Taira, V.Bashyam and H.Kangarloo,”A field theoretical approach to medical natural language processing.IEEE transaction on Information Technology in Biomedicine,11(4):July 2007, ISSN: 1089-7771 [26] Li-Shiang T, Seunghyun I “Mining Generalized Actionable Rules Using Concept Hierarchies”, Fifth International Joint Conference on . INC, IMS and IDC, August 2009,E-ISBN: 978-0-7695-3769-6