APPLYING ARTIFICIAL IMMUNE RECOGNITION SYSTEM TO ENHANCE THE QUALITY OF DIABETES DIAGNOSING Hung-Chun Lin1, Pa-Chun Wang, MD2, MSc, and Chao-Ton Su3 1 Department of Industry Engineering and Engineering Management, National Tsing Hua University Hsinchu, Taiwan E-mail: d9534801@oz.nthu.edu.tw 2 Department of Otolaryngology, Cathay General Hospital, Fu Jen Catholic University School of Medicine, Taipei, Taiwan E-mail: drtony@tpts4.seed.net.tw 3 Department of Industry Engineering and Engineering Management, National Tsing Hua University Hsinchu, Taiwan E-mail: ctsu@mx.nthu.edu.tw ABSTRACT Diabetes is a kind of diseases of civilization which is not easily diagnosed in its initial stage but may affect a patient very seriously in later stage. According to the estimation of World Health Organization (WHO), there will be three hundred and seventy million diabetics which are 5.4% of the global people in 2030, so it becomes more and more important to diagnose whether a person has or is likely to acquire diabetes. This study is conducted with the use of the machine learning—Artificial Immune Recognition System (AIRS)—to diagnose pregnant women who have symptoms of type 2 diabetes. AIRS is proposed by Andrew Watkins (2001). It makes use of the metaphor of the vertebrate immune system to recognize antigens, memorize cells, and select clone. Additionally, AIRS includes a mechanism, limited resource, to control the amount of memory cells. It has also showed positive results on problems in which it was applied. This study employs AIRS on the classification of type 2 diabetes. The dataset of diabetes has imbalanced data, but the overall classification recall could still reach 62.8%, which is better than the traditional method, Logistic Regression. Keywords: diseases of civilization, AIRS, type 2 diabetes, vertebrate immune system, imbalanced data INTRODUCTION With the progress of the modern civilization, the improvement of nutrition and the change of living form have made the increasing of diabetics in recent years. Diabetes is a kind of disease of civilization that affects other organs. It is also a type of illness that requires high cost of medical treatment. Twenty years ago, there are about 30 million diabetics in the whole world. World Health Organization (WHO) pegged the number of diabetics in the year 2030 at 370 million which is 5.4% of the population of that time. Among the rising number of diabetics, more than 90% of them have type 2 diabetes. In Taiwan, the diabetes is the forth leading cause of the death in 2002, and the second cause of death among middle aged women. Fortunately, there are some symptoms that will help the doctors in detecting diabetes in women. For example, the occurrence of gestational diabetes mellitus (GDM) is thought as pre-diabetes state and a very dangerous type of diabetes. In Ho et al. (2006)’s study, the researchers wanted to find the relationship between type 2 diabetes and GDM. The information on these signs, however, is much and complicated. Though many doctors can rely on their experience in diagnosing patients, how to diagnose the patients systematically requires a rather more in depth understanding. The purpose of this study is to 1 investigate the factors that affect women with GDM and to develop a model to predict whether women will acquire diabetes or not. BACKGROUND Synopsis of diabetes Generally, our body converts consumed starch into glucose as the body’s fuel. Insulin, a hormone produced by the pancreas, induces the glucose entering the cells to produce heat energy. When afflicted with diabetes, a person’s pancreas cannot produce enough insulin to adequately react with the glucose that enters the cells which makes the blood sugar concentration rise (Department of Health, Executive Yuan, R.O.C. (Taiwan)). According to the present researches, the occurrence of diabetes is related to the heredity in some degree and pressures, depressions, fatness, pregnancy, drug, and lack of nutrition can cause diabetes. Diabetics, generally, don’t have the obvious symptoms in the initial stage. Only after the examination can diabetics be aware of their condition. After the gradual increase of blood sugar, the diabetics can observe the symptoms such as hunger, thirst, frequent urination, tiredness, weight loss, blurry vision or longer time for wounds to heal (Formosan Diabetes Care Foundation). In 1979, the American Diabetes Association classified diabetes into four types: type 1 diabetes (Insulin-Dependent Diabetes Mellitus), type 2 diabetes (Non-Insulin-Dependent Diabetes Mellitus), glucose tolerance impairment and GDM, and diabetes caused by other situations (Taipei Veterans General Hospital Endocrinology & Metabolism). From hereon, diabetes means the type 2 diabetes. Diabetes classification problem The dataset used in this study is obtained from the medical cases of pregnant women in a particular medical center, and was grouped into three groups. The set of data used is a so-call imbalanced data wherein the majority part of the dataset comes from one or two classes and the rest from the other classes. The first group has non-diabetic women and is labeled Normal. The second is Pre-DM which indicates that the women are in pre-diabetes state. The third is DM which indicates that the women have diabetes. The dataset has 12 attributes and they are as follows: age, screen, ap_ac, pc100g_1, pc100g_2, pc100g_3, BMIprep, weight_before, increased_weight, NB_weight, DM_history, and birth_order. There are 152 instances. Ten of these instances are DM, 32 are Pre-DM, and 110 instances are Normal. The missing values in the dataset are replaced with the average value of the known values for the attribute in this study. Previous research There have been many researches in the field of classification. Kaieda and Abe (2004) used kernel principal component analysis (KPCA) to develop a fuzzy classifier with ellipsoidal region to the datasets in UCI databank or other researchers used such as the dataset of blood cell. In Cao et al. (2006), rough set was used in the prediction of protein structural classes. Electromyography (EMG) signals were classified with autoregressive model, artificial neural network (ANN), and higher-order statistical (HOS) model in Reaz, Hussain, and Mohd-Yasin (2006). In Boriboonhirunsarn and Sunsaneevithayakul (2008), logistic regression was used to determine the independent risk factors related to the normality or abnormality for four sub-groups of diabetes dataset. According to the previous related works, the utilization of different methods of classification is more popular in medical science. Natural and artificial immune systems The natural immune system is a distributed pattern detection system. It consists of several functional elements throughout the body. The immune system is handling the defense mechanism of the body by innate and adaptive immune response. Adaptive immune response is especially important for us because it has abilities like memory acquisition, diversity, recognition, etc. The adaptive immunity then become as the main line of defense in the body and has three key properties. It responds only if an invader is present. It remembers a previous contact with an invader, therefore responding faster after initial recognition. Third, it can differentiate between the self and the non-self. The immune system is composed of lymphocytes which are classified into two types such as B and T cells. T-cells mature in the thymus, 2 while B-cells mature in the bone marrow. Both of these lymphocytes produce antibodies that bind to invading antigens. Antigens can be thought of as viruses that trigger immune response. The antigens and the antibodies have one-to-one correspondence. When the foreign antigens invade the body, B-cells will produce corresponding antibodies binding to the antigens. The Artificial Immune System (AIS) imitates this mechanism. When data enter the system like an antigen, the model will generate the corresponding datum as the antibody and store this in the memory cells to immediately respond when the same situation happens. The AIS can produce similar antibodies through mutation to respond to the similar data faster. METHOD Artificial Immune Recognition System (AIRS) classification algorithm AIRS is a resource limited and a type of supervised learning algorithm. This algorithm used immune mechanisms are resources competition, clone selection, maturation, mutation and memory cells generation. The training and test data items are viewed as antigens in the system. These antigens induce the B-cells in the system to produce artificial recognition balls (ARBs). These ARBs compete with each other for the given resource number. The ARBs with higher resources will get more chances to produce the mutated offspring to improve the system. The memory cells generated after all training antigens have been introduced are used to classify the test date items. The algorithm is composed of five stages. These stages are initialization, memory cell identification and ARB generation, competition for resources and development of a candidate memory cell, memory cell introduction, and classification. The figure 1 shows the flowchart of AIRS algorithm. Initialization The first stage of the algorithm is data pre-processing stage. In this stage, all items in the dataset are normalized to comply with the Euclidean distance which says that the distance between any two data is in the range of [0, 1]. The calculation of the Euclidean distance is showed in (1) Euclidean _ distance m x i 1 i 1 yi 2 where x and y represent feature vectors and m is the number of attributes in data. After normalization, the affinity threshold is calculated as (2) affinityag , ag n affinity _ threshold n i 1 j i 1 i j 2 nn 1 2 where n is the number of training data items, ag i and in the training vector, and represents the Euclidean distance between the two affinityag i , ag j antigens’ feature vector. 3 ag j are the ith and jth training antigen Input data Initialization Normalization of the data Memory cell identification and ARB generation Cloning the most stimulated memory cells and adding into the population of pre-ARBs Competition for resources and development of a candidate memory cell Allocating resource according to the stimulated level and defining the candidate memory cell Memory cell introduction Adding the candidate memory cell into the memory cell pool if some value is achieved No Is all training data trained? Yes Classification Classifying test data according to the memory cells with kNN Figure 1 – Flowchart of AIRS algorithm 4 Memory cell identification and ARB generation mcmatch , given a specific training In this stage, the first step is to find out the memory cell, antigen. The mcmatch is defined as arg max mcMC stimulationag , mc , where stimulationx, y is defined as (3). 3 stimulationx, y 1 Euclidean _ distancex, y Given this definition, it can be assumed that the antibody, mcmatch , is the most stimulated memory cell by the given antigen in the set of memory cells of the same category. When mcmatch is identified, this cell is used to create new ARBs to be introduced to the population of pre-existing ARBs. The number of new ARBs depends on the stimulation value between the memory cell and the antigen. Competition for resources and development of a candidate memory cell In this stage, all ARBs presently existing in the system are awarded the resource numbers according to their affinity values. The ARBs with higher affinity values will get more resources than those with lower affinity values. If the sum of the number of resources of all ARBs exceeds the allowed number, the system will remove the ARBs and their awarded resources beginning with the lowest number of resources until the sum of the number of resources of all ARBs is lower than the allowed number in the system. After this process, the stimulation values of all remaining ARBs are calculated and maximal and minimal stimulation values are determined as maxStim and minStim , respectively. The stimulation level of each ARB is recalculated as (4). arb.stim minStim maxStim minStim , arb.stim 1 arb.stim minStim , maxStim minStim if class of arb class of antigen 4 otherwise Then, we use (5) to calculate the average value of these levels for each class and verify if any of these average values is lower than a given stimulation threshold or not. If any of the average values is lower, the ARBs belonging to that class are mutated and the generated clones are added to ARB pool. This process is continuously done until the average stimulation levels of all classes are larger than the stimulation threshold. ARBi si where arb .stim j 1 j ARBi , 5 arb j ARBi i 1, 2,..., nc , s s1 , s 2 ,..., s nc , ARBi class and ab j .stim is the stimulation level of jth is the number of ARBs belonging to ARB of ith ith class. Memory cell introduction After achieving the criterion described above, the ARB with the highest stimulation value in the same class with the presented training antigen is taken as a candidate memory cell, stimulation value of the value of the mccandidate mccandidate . If the motivated by the training antigen is higher than the stimulation mcmatch , the candidate memory cell is added to the set of memory cells. If this test is passed, a calculation of the affinity between mccandidate 5 and mcmatch must be obtained. If the affinity between this two memory cells is lower than the product of the affinity threshold and the affinity threshold scalar, mccandidate is replaced with mcmatch in the set of memory cells. Classification After repeating step 2 to step 4 to each training antigen, the developed memory cells are ready for exploitation and for classification. The classification is executed in a k-nearest neighbor approach. The classification of a datum in the system is determined by the ballot of the results of the k most stimulated memory cells. Used parameters One advantage of AIRS is that it is not necessary to try all combinations of all parameters to find the best one. AIRS is self-adjusting to the feature of its architecture. According to the experience of Goodman, Boggess & Watkins (2002), the setting of AIRS’s parameters had a classifier with only a few percentages of accuracy less than the optimal combination of the parameters of the system. Therefore, we set the parameters as in table 1. Table 1 – Used parameters in AIRS for Diabetes dataset Value Parameter 200 0.9 0.1 10 2 0.2 1 Number of resources in system Stimulation threshold Mutation rate Clone rate Hyper mutation rate Affinity threshold scalar k value for k-nearest neighbour Measure for performance evaluation With an imbalanced data as dataset, the classification recall used for the dataset is defined as (6) Tj recall T j assess t i i 1 Tj , ti T j 6 iff classify t t.c 1, assess t otherwise 0, where T j is the set of data items which belong to class j to be classified (the test set) and T j number of T j , t T j , t.c is the class of item t, and classify t is the returns the classification of t by the developed classifier and the classification recall of each class is calculated and the overall recall of classification is the average of the percentage of the three classification recall. Furthermore, for test results to be more credible, 5-fold cross-validation is used in this application. 6 RESULTS AND DISCUSSION Table 2 shows that AIRS is compared with other classifiers. Although the classification recall of Normal of AIRS is not the highest, this is not the focus of the diabetes dataset. The doctors want to find out the potential diabetics through this dataset. The information that can be gathered from the results can help the doctors to easily diagnose through AIRS whether the pregnant women is likely to have diabetes, because AIRS could achieve the highest recall among the three algorithms in the table. Table 2– Classification accuracy of AIRS for diabetes classification problem with classification accuracies obtained by other methods Testing criterion DM Pre-DM Normal Overall Logistic Regression 0.100 0.033 0.882 0.338 Decision Tree with C4.5 0.500 0.252 0.736 0.496 AIRS 0.600 0.538 0.745 0.628 Algorithms CONCLUSION In an era of information explosion, how to quickly and correctly decide is a difficult problem, especially for doctors when diagnosing patients’ symptoms. More and more people are baffled by this disease. A lot of diseases should be diagnosed and treated early and diabetes is no exception. We make use of AIRS to determine that what type of people is likely to acquire diabetes and for the doctors to accurately diagnose patients. According to the result above, AIRS is more accurate than other traditional methods. It is hoped that there are more interesting results in the future researches. REFERENCES A. Watkins, AIRS: A resource limited artificial immune classifier. Master thesis, Mississippi State University, 2001. D. Boriboonhirunsarn and P. Sunsaneevithayakul, “Abnormal results on a second testing and risk of gestational diabetes in women with normal baseline glucose levels,” International Journal of Gynecology and Obstetrics, 2008, vol. 100, issue 2, pp. 147-153. Department of Health, Executive Yuan, R.O.C. (Taiwan). URL: http://www.doh.gov.tw/CHT2006/DM/SEARCH_MAIN.aspx?keyword=%u7cd6%u5c3f%u75c5 Formosan Diabetes Care Foundation. URL: http://www.dmcare.org.tw/ H.H. Ho, P.C. Hsieh, C.Y. Li, H.F. Su, “A relational study of gestational diabetes and type 2 diabetes,” Taiwan Journal of Public Health, 2006, vol. 25, no. 2, pp. 143-151. K. Kaieda and S. Abe, “KPCA-based training of a kernel fuzzy classifier with ellipsoidal regions,” International Journal of Approximate Reasoning, 2004, vol. 37, pp. 189-217. M. B. I. Reaz, M. S. Hussain, and F. Mohd-Yasin, “Techniques of EMG signal analysis: detection, processing, classification and applications,” Biological Procedures Online, 2006, vol. 8, pp. 11-35. Taipei Veterans General Hospital Endocrinology & Metabolism. URL: http://homepage.vghtpe.gov.tw/~meta/dm.htm#DM_basic Y. Cao, S. Liu, L. Zhang, J. Qin, J. Wang, and K. Tang, “Prediction of protein structural class with Rough Sets,” BMC Bioinformatics, 2006, vol. 7, Article Number 20. 7