applying artificial immune recognition system to enhance the quality

advertisement
APPLYING ARTIFICIAL IMMUNE RECOGNITION SYSTEM TO
ENHANCE THE QUALITY OF DIABETES DIAGNOSING
Hung-Chun Lin1, Pa-Chun Wang, MD2, MSc, and Chao-Ton Su3
1
Department of Industry Engineering and Engineering Management, National Tsing Hua University Hsinchu, Taiwan
E-mail: d9534801@oz.nthu.edu.tw
2
Department of Otolaryngology, Cathay General Hospital, Fu Jen Catholic University School of Medicine, Taipei,
Taiwan
E-mail: drtony@tpts4.seed.net.tw
3
Department of Industry Engineering and Engineering Management, National Tsing Hua University Hsinchu, Taiwan
E-mail: ctsu@mx.nthu.edu.tw
ABSTRACT
Diabetes is a kind of diseases of civilization which is not easily diagnosed in its initial stage but
may affect a patient very seriously in later stage. According to the estimation of World Health
Organization (WHO), there will be three hundred and seventy million diabetics which are 5.4% of the
global people in 2030, so it becomes more and more important to diagnose whether a person has or is
likely to acquire diabetes. This study is conducted with the use of the machine learning—Artificial
Immune Recognition System (AIRS)—to diagnose pregnant women who have symptoms of type 2
diabetes. AIRS is proposed by Andrew Watkins (2001). It makes use of the metaphor of the vertebrate
immune system to recognize antigens, memorize cells, and select clone. Additionally, AIRS includes a
mechanism, limited resource, to control the amount of memory cells. It has also showed positive results
on problems in which it was applied. This study employs AIRS on the classification of type 2 diabetes.
The dataset of diabetes has imbalanced data, but the overall classification recall could still reach 62.8%,
which is better than the traditional method, Logistic Regression.
Keywords: diseases of civilization, AIRS, type 2 diabetes, vertebrate immune system, imbalanced data
INTRODUCTION
With the progress of the modern civilization, the improvement of nutrition and the change of living form
have made the increasing of diabetics in recent years. Diabetes is a kind of disease of civilization that
affects other organs. It is also a type of illness that requires high cost of medical treatment. Twenty years
ago, there are about 30 million diabetics in the whole world. World Health Organization (WHO) pegged
the number of diabetics in the year 2030 at 370 million which is 5.4% of the population of that time.
Among the rising number of diabetics, more than 90% of them have type 2 diabetes. In Taiwan, the
diabetes is the forth leading cause of the death in 2002, and the second cause of death among middle aged
women. Fortunately, there are some symptoms that will help the doctors in detecting diabetes in women.
For example, the occurrence of gestational diabetes mellitus (GDM) is thought as pre-diabetes state and a
very dangerous type of diabetes. In Ho et al. (2006)’s study, the researchers wanted to find the
relationship between type 2 diabetes and GDM. The information on these signs, however, is much and
complicated. Though many doctors can rely on their experience in diagnosing patients, how to diagnose
the patients systematically requires a rather more in depth understanding. The purpose of this study is to
1
investigate the factors that affect women with GDM and to develop a model to predict whether women
will acquire diabetes or not.
BACKGROUND
Synopsis of diabetes
Generally, our body converts consumed starch into glucose as the body’s fuel. Insulin, a hormone
produced by the pancreas, induces the glucose entering the cells to produce heat energy. When afflicted
with diabetes, a person’s pancreas cannot produce enough insulin to adequately react with the glucose
that enters the cells which makes the blood sugar concentration rise (Department of Health, Executive
Yuan, R.O.C. (Taiwan)).
According to the present researches, the occurrence of diabetes is related to the heredity in some
degree and pressures, depressions, fatness, pregnancy, drug, and lack of nutrition can cause diabetes.
Diabetics, generally, don’t have the obvious symptoms in the initial stage. Only after the examination can
diabetics be aware of their condition. After the gradual increase of blood sugar, the diabetics can observe
the symptoms such as hunger, thirst, frequent urination, tiredness, weight loss, blurry vision or longer
time for wounds to heal (Formosan Diabetes Care Foundation).
In 1979, the American Diabetes Association classified diabetes into four types: type 1 diabetes
(Insulin-Dependent Diabetes Mellitus), type 2 diabetes (Non-Insulin-Dependent Diabetes Mellitus),
glucose tolerance impairment and GDM, and diabetes caused by other situations (Taipei Veterans
General Hospital Endocrinology & Metabolism). From hereon, diabetes means the type 2 diabetes.
Diabetes classification problem
The dataset used in this study is obtained from the medical cases of pregnant women in a particular
medical center, and was grouped into three groups. The set of data used is a so-call imbalanced data
wherein the majority part of the dataset comes from one or two classes and the rest from the other classes.
The first group has non-diabetic women and is labeled Normal. The second is Pre-DM which indicates
that the women are in pre-diabetes state. The third is DM which indicates that the women have diabetes.
The dataset has 12 attributes and they are as follows: age, screen, ap_ac, pc100g_1, pc100g_2, pc100g_3,
BMIprep, weight_before, increased_weight, NB_weight, DM_history, and birth_order. There are 152
instances. Ten of these instances are DM, 32 are Pre-DM, and 110 instances are Normal. The missing
values in the dataset are replaced with the average value of the known values for the attribute in this
study.
Previous research
There have been many researches in the field of classification. Kaieda and Abe (2004) used kernel
principal component analysis (KPCA) to develop a fuzzy classifier with ellipsoidal region to the datasets
in UCI databank or other researchers used such as the dataset of blood cell. In Cao et al. (2006), rough set
was used in the prediction of protein structural classes. Electromyography (EMG) signals were classified
with autoregressive model, artificial neural network (ANN), and higher-order statistical (HOS) model in
Reaz, Hussain, and Mohd-Yasin (2006). In Boriboonhirunsarn and Sunsaneevithayakul (2008), logistic
regression was used to determine the independent risk factors related to the normality or abnormality for
four sub-groups of diabetes dataset. According to the previous related works, the utilization of different
methods of classification is more popular in medical science.
Natural and artificial immune systems
The natural immune system is a distributed pattern detection system. It consists of several
functional elements throughout the body. The immune system is handling the defense mechanism of the
body by innate and adaptive immune response. Adaptive immune response is especially important for us
because it has abilities like memory acquisition, diversity, recognition, etc. The adaptive immunity then
become as the main line of defense in the body and has three key properties. It responds only if an invader
is present. It remembers a previous contact with an invader, therefore responding faster after initial
recognition. Third, it can differentiate between the self and the non-self. The immune system is composed
of lymphocytes which are classified into two types such as B and T cells. T-cells mature in the thymus,
2
while B-cells mature in the bone marrow. Both of these lymphocytes produce antibodies that bind to
invading antigens. Antigens can be thought of as viruses that trigger immune response. The antigens and
the antibodies have one-to-one correspondence. When the foreign antigens invade the body, B-cells will
produce corresponding antibodies binding to the antigens. The Artificial Immune System (AIS) imitates
this mechanism. When data enter the system like an antigen, the model will generate the corresponding
datum as the antibody and store this in the memory cells to immediately respond when the same situation
happens. The AIS can produce similar antibodies through mutation to respond to the similar data faster.
METHOD
Artificial Immune Recognition System (AIRS) classification algorithm
AIRS is a resource limited and a type of supervised learning algorithm. This algorithm used
immune mechanisms are resources competition, clone selection, maturation, mutation and memory cells
generation. The training and test data items are viewed as antigens in the system. These antigens induce
the B-cells in the system to produce artificial recognition balls (ARBs). These ARBs compete with each
other for the given resource number. The ARBs with higher resources will get more chances to produce
the mutated offspring to improve the system. The memory cells generated after all training antigens have
been introduced are used to classify the test date items. The algorithm is composed of five stages. These
stages are initialization, memory cell identification and ARB generation, competition for resources and
development of a candidate memory cell, memory cell introduction, and classification. The figure 1
shows the flowchart of AIRS algorithm.
Initialization
The first stage of the algorithm is data pre-processing stage. In this stage, all items in the dataset
are normalized to comply with the Euclidean distance which says that the distance between any two data
is in the range of [0, 1]. The calculation of the Euclidean distance is showed in (1)
Euclidean _ distance 
m
 x
i 1
i
1
 yi 
2
where x and y represent feature vectors and m is the number of attributes in data.
After normalization, the affinity threshold is calculated as (2)
  affinityag , ag 
n
affinity _ threshold 
n
i 1 j i 1
i
j
2
nn  1
2
where n is the number of training data items, ag i
and
in the training vector, and
represents the Euclidean distance between the two
affinityag i , ag j 
antigens’ feature vector.
3
ag j
are the
ith
and
jth
training antigen
Input data
Initialization
Normalization of the data
Memory cell identification and ARB generation
Cloning the most stimulated memory cells and
adding into the population of pre-ARBs
Competition for resources and development of a
candidate memory cell
Allocating resource according to the stimulated
level and defining the candidate memory cell
Memory cell introduction
Adding the candidate memory cell into the memory
cell pool if some value is achieved
No
Is all training data
trained?
Yes
Classification
Classifying test data according to the memory cells
with kNN
Figure 1 – Flowchart of AIRS algorithm
4
Memory cell identification and ARB generation
mcmatch , given a specific training
In this stage, the first step is to find out the memory cell,
antigen. The
mcmatch
is defined as
arg max mcMC stimulationag , mc , where
stimulationx, y  is defined as (3).
3
stimulationx, y   1  Euclidean _ distancex, y 
Given this definition, it can be assumed that the antibody, mcmatch , is the most stimulated memory cell
by the given antigen in the set of memory cells of the same category. When mcmatch is identified, this
cell is used to create new ARBs to be introduced to the population of pre-existing ARBs. The number of
new ARBs depends on the stimulation value between the memory cell and the antigen.
Competition for resources and development of a candidate memory cell
In this stage, all ARBs presently existing in the system are awarded the resource numbers
according to their affinity values. The ARBs with higher affinity values will get more resources than
those with lower affinity values. If the sum of the number of resources of all ARBs exceeds the allowed
number, the system will remove the ARBs and their awarded resources beginning with the lowest number
of resources until the sum of the number of resources of all ARBs is lower than the allowed number in the
system. After this process, the stimulation values of all remaining ARBs are calculated and maximal and
minimal stimulation values are determined as maxStim and minStim , respectively. The
stimulation level of each ARB is recalculated as (4).
 arb.stim  minStim
 maxStim  minStim ,
arb.stim  
1  arb.stim  minStim ,
 maxStim  minStim
if class of arb  class of antigen
4
otherwise
Then, we use (5) to calculate the average value of these levels for each class and verify if any of these
average values is lower than a given stimulation threshold or not. If any of the average values is lower,
the ARBs belonging to that class are mutated and the generated clones are added to ARB pool. This
process is continuously done until the average stimulation levels of all classes are larger than the
stimulation threshold.
ARBi
si 
where
 arb .stim
j 1
j
ARBi
,
5
arb j  ARBi
i  1, 2,..., nc , s  s1 , s 2 ,..., s nc  , ARBi
class and
ab j .stim
is the stimulation level of
jth
is the number of ARBs belonging to
ARB of
ith
ith
class.
Memory cell introduction
After achieving the criterion described above, the ARB with the highest stimulation value in the
same class with the presented training antigen is taken as a candidate memory cell,
stimulation value of the
value of the
mccandidate
mccandidate . If the
motivated by the training antigen is higher than the stimulation
mcmatch , the candidate memory cell is added to the set of memory cells. If this test is
passed, a calculation of the affinity between
mccandidate
5
and
mcmatch
must be obtained. If the
affinity between this two memory cells is lower than the product of the affinity threshold and the affinity
threshold scalar,
mccandidate
is replaced with
mcmatch
in the set of memory cells.
Classification
After repeating step 2 to step 4 to each training antigen, the developed memory cells are ready for
exploitation and for classification. The classification is executed in a k-nearest neighbor approach. The
classification of a datum in the system is determined by the ballot of the results of the k most stimulated
memory cells.
Used parameters
One advantage of AIRS is that it is not necessary to try all combinations of all parameters to find
the best one. AIRS is self-adjusting to the feature of its architecture. According to the experience of
Goodman, Boggess & Watkins (2002), the setting of AIRS’s parameters had a classifier with only a few
percentages of accuracy less than the optimal combination of the parameters of the system. Therefore, we
set the parameters as in table 1.
Table 1 – Used parameters in AIRS for Diabetes dataset
Value
Parameter
200
0.9
0.1
10
2
0.2
1
Number of resources in system
Stimulation threshold
Mutation rate
Clone rate
Hyper mutation rate
Affinity threshold scalar
k value for k-nearest neighbour
Measure for performance evaluation
With an imbalanced data as dataset, the classification recall used for the dataset is defined as (6)
Tj
recall T j  
 assess t 
i
i 1
Tj
,
ti  T j
6
iff classify t   t.c
1,
assess t   
otherwise
0,
where T j is the set of data items which belong to class j to be classified (the test set) and T j
number of
T j , t  T j , t.c
is the class of item t, and
classify t 
is the
returns the classification of t by
the developed classifier and the classification recall of each class is calculated and the overall recall of
classification is the average of the percentage of the three classification recall.
Furthermore, for test results to be more credible, 5-fold cross-validation is used in this application.
6
RESULTS AND DISCUSSION
Table 2 shows that AIRS is compared with other classifiers. Although the classification recall of
Normal of AIRS is not the highest, this is not the focus of the diabetes dataset. The doctors want to find
out the potential diabetics through this dataset. The information that can be gathered from the results can
help the doctors to easily diagnose through AIRS whether the pregnant women is likely to have diabetes,
because AIRS could achieve the highest recall among the three algorithms in the table.
Table 2– Classification accuracy of AIRS for diabetes classification problem with classification
accuracies obtained by other methods
Testing criterion
DM
Pre-DM
Normal
Overall
Logistic Regression
0.100
0.033
0.882
0.338
Decision Tree with C4.5
0.500
0.252
0.736
0.496
AIRS
0.600
0.538
0.745
0.628
Algorithms
CONCLUSION
In an era of information explosion, how to quickly and correctly decide is a difficult problem,
especially for doctors when diagnosing patients’ symptoms. More and more people are baffled by this
disease. A lot of diseases should be diagnosed and treated early and diabetes is no exception. We make
use of AIRS to determine that what type of people is likely to acquire diabetes and for the doctors to
accurately diagnose patients. According to the result above, AIRS is more accurate than other traditional
methods. It is hoped that there are more interesting results in the future researches.
REFERENCES
A. Watkins, AIRS: A resource limited artificial immune classifier. Master thesis, Mississippi State University, 2001.
D. Boriboonhirunsarn and P. Sunsaneevithayakul, “Abnormal results on a second testing and risk of gestational
diabetes in women with normal baseline glucose levels,” International Journal of Gynecology and Obstetrics, 2008,
vol. 100, issue 2, pp. 147-153.
Department of Health, Executive Yuan, R.O.C. (Taiwan). URL:
http://www.doh.gov.tw/CHT2006/DM/SEARCH_MAIN.aspx?keyword=%u7cd6%u5c3f%u75c5
Formosan Diabetes Care Foundation. URL: http://www.dmcare.org.tw/
H.H. Ho, P.C. Hsieh, C.Y. Li, H.F. Su, “A relational study of gestational diabetes and type 2 diabetes,” Taiwan Journal
of Public Health, 2006, vol. 25, no. 2, pp. 143-151.
K. Kaieda and S. Abe, “KPCA-based training of a kernel fuzzy classifier with ellipsoidal regions,” International
Journal of Approximate Reasoning, 2004, vol. 37, pp. 189-217.
M. B. I. Reaz, M. S. Hussain, and F. Mohd-Yasin, “Techniques of EMG signal analysis: detection, processing,
classification and applications,” Biological Procedures Online, 2006, vol. 8, pp. 11-35.
Taipei Veterans General Hospital Endocrinology & Metabolism. URL:
http://homepage.vghtpe.gov.tw/~meta/dm.htm#DM_basic
Y. Cao, S. Liu, L. Zhang, J. Qin, J. Wang, and K. Tang, “Prediction of protein structural class with Rough Sets,” BMC
Bioinformatics, 2006, vol. 7, Article Number 20.
7
Download