Paper Title (use style: paper title)

advertisement
International Journal of Emerging
Technology & Research
Volume 1, Issue 4, May-June, 2014
(www.ijetr.org)
ISSN (E): 2347-5900 ISSN (P): 2347-6079
Hybrid Generic Algorithm-Intelligent Water Drops Based Feature
Selection for Breast Cancer Diagnosis
S.Sindiya1, S.Gunasundari2
1
PG Scholar, Department of Computer Science and Engineering, Velammal Engineering College, Chennai, India
Assistant Professor, Department of Computer Science and Engineering, Velammal Engineering College, Chennai, India
2
Abstract— Clinical diagnosis is done commonly by
doctor’s capability and practice. But still cases are
conveyed of wrong diagnosis and treatment. Patients are
requested to take number of tests for diagnosis. Breast
Cancer is one of the serious problems in medical diagnosis
and it is the second largest cause of cancer deaths among
women. The objective of the improvement of breast cancer
identification system is to support the radiologist in the
arrangement of tumor as benign or malignant. This
project proposes a system which can select the best
features. It is a task of detecting and choosing an
appropriate subset of features from a larger set of
features. The objective of this work is to calculate more
precisely the occurrence of breast cancer with reduced
number of features and to increase the classification
accuracy rates. Hybrid Genetic Algorithm (GA) and
Intelligent Water Drops (IWD) are used to decide the
attributes which contribute more towards the diagnosis of
breast cancer which indirectly decreases the number of
tests which are required to be taken by a patient. Support
Vector Machine (SVM) classifier is used to classify
whether breast cancer is present or not.
.
Keywords— Breast Cancer, Genetic Algorithm (GA), Intelligent
Water Drops (IWD), Support Vector Machine (SVM).
I. Introduction
The advancements in the field of computer and database
technologies, made data accumulation unmatchable by the
human’s capacity of data processing. The multidisciplinary
joint effort from databases, machine learning and statistics is
supporting to turn big data into nuggets. A realization to use
data mining tools more effectively is through data processing,
most researchers and practitioners had realized this fact too.
Feature selection is one of the essential and commonly
used methods in data preprocessing [1] , [2]. A "feature" or
"attribute" or "variable" denotes to an aspect of the data.
© Copyright reserved by IJETR
Generally before gathering data, features are identified or
selected. Features can be discrete, continuous and nominal.
Normally, features are categorized as relevant features which
have an impact on the output and their role cannot be expected
by the rest, irrelevant features are defined as those features not
having any impact on the output and whose values are
produced at random, and redundancy exists whenever a
feature can take the role of another feature. Computation of
the result by using the entire features may not always give
the best result because of unnecessary and
irrelevant
features, also referred as noisy features. To remove these
unnecessary features, it is essential to use a feature selection
algorithm which selects a subset of important features from
the parent set by removing the irrelevant features for simple
and accurate data. By reducing the unwanted and irrelevant
features it is possible to reduce the size of the set. This is
valuable as it increases the classification accuracy and reduces
the computational cost and also it removes the risk of overfitting.Holland [3] developed GA as a generalization of
natural evolution and provided a theoretical structure for
variation under GA. GA is fundamentally used as a problem
solving strategy in order to provide with an optimal solution
[4]. GA includes a subset of the growth-based optimization
methods aiming at the use of the GA operators such as
selection, mutation and recombination to a population of
challenging problem solutions. IWD is an optimization
algorithm motivated from natural water drops which alter their
environment to discover the near optimal or optimal path to
their target. The memory is the river’s bed and what is
changed by the water drops is the quantity of soil on the
river’s bed.The rest of the sections are organized in the
following manner. Section 2 specifies about literature review
based on GA and IWD. Section 3.2.1 discusses on reducing
the number of attributes using GA. Section 3.2.2 discusses on
reducing the number of attributes using IWD. Section 3.3
reviews basic SVM concepts. Section 3.4 describes the
hybridization of GA and IWD. Section 3.5 discusses on
Classification Performance Metrics. Section 4 presents
experimental results from using the proposed method to
(Impact Factor: 0.997)
798
International Journal of Emerging Technology & Research
(www.ijetr.org) ISSN (E): 2347-5900 ISSN (P): 2347-6079
Volume 1, Issue 4, May-June, 2014
diagnose breast cancer. Finally, Section 5 concludes the paper
along with outlining future work.
2. Related Works
2.1 Overview of Genetic Algorithm for Different
Applications
Akin Ozcift et al, [5] have determined a GA wrapped
Bayesian network feature selection for the diagnosis of
erythemato-squamous diseases. It has been resulted in
99.20% accuracy by Bayesian Network (BN). Then, it is
moreover tested with the other classifiers such as Support
Vector Machine (SVM), Multi-Layer Perceptron (MLP),
Simple Logistics (SL) and Functional Decision Tree (FT).
The classification accuracies obtained by SVM is 98.36%,
MLP is 97.00%, SL is 98.36% and FT is 97.81% respectively.
Hossein Ghayoumi Zadeh, [6] has examined a method for
the diagnosis of breast cancer. The input to this model is 8
parameters which are designed using artificial neural network
and genetic algorithm. The sensitivity attained by this model
was 50% and the specificity obtained by it was 75% and the
accuracy obtained is 70%.Nidhi Batla et al, [7] has suggested
a study which involved various data mining methods for the
prediction of heart disease. Several data mining methods
can be used in automated heart disease prediction systems.
The result shows that Neural Network by using 15 attributes
has presented the highest accuracy. Decision Tree has also
achieved well with 99.62% accuracy by using 15 attributes.
The GA with Decision Tree has reduced the number of
attributes from 15 attributes to 6 attributes, and has the
accuracy of 99.2%.Kerry Seitz et al, [8] have examined the
model by using GA for learning lung nodule similarity. In
order to improve the effectiveness of content-based image
retrieval (CBIR), it is necessary to decide an optimal
combination of image features that can be used in defining
the relationship between images. The GA is used to optimize
the combination of features. The accuracy of the CBIR
model is increased as the number of image features reduced.
The classification accuracy obtained with this model is
86.91%.
2.2 Overview of Intelligent Water Drops Algorithm for
Different Applications
Among the latest nature-motivated swarm-based optimization
algorithms is the Intelligent Water Drops (IWD) algorithm.
IWD algorithms mimic few of the methods that happen in
nature between the water drops of a river and the soil of the
river bed. The IWD algorithm was first presented in 2007 in
which the IWDs are used to solve the Travelling Salesman
Problem (TSP). The IWD algorithm has also been effectively
applied to the Multidimensional Knapsack Problem (MKP), nqueen puzzle and Robot Path Planning.Yusuf Hendrawan et
© Copyright reserved by IJETR
al, [9] implemented a study to improve nature-inspired
feature selection methods to discover the most important set of
Textural Features (TFs) suitable for guessing water content of
cultured Sunagoke moss. This process is linked with NeuralSimulated Annealing (N-SA), Neural-Genetic Algorithms (NGAs) and Neural-Discrete Particle Swarm Optimization (NDPSO). 36 features attained from N-IWD with Root Mean
Square Error of 1.07 *
.Chinh Hoang et al, [10] presented
a model which is used to build the best data aggregation trees
for the wireless sensor networks. The best total number of hop
count,
achieved is 83 and the average total number of hop
count in the data aggregation tree,
is 84.3.
3. System Study
3.1 Introduction
The purpose of the design phase is to plan out a system that
meets the requirements defined in the analysis phase. It
defines the means of implementing the project solution. In the
design phase the architecture is established. This phase starts
with the requirement document delivered by the requirement
phase and maps the requirements into architecture. The
architecture defines the components, their interfaces and
behaviors.
3.2 Hybrid Genetic Algorithm (GA) and Intelligent
Water Drops (IWD)
3.2.1 Genetic Algorithm (GA)
GA and IWD are useful for feature selection. GA is a search
heuristic that mimics the process of natural selection. The
basic purpose is optimization. It belongs to the larger class of
evolutionary algorithm. A GA is a search technique used in
computing to find true or approximate solutions to
optimization and search problems. The procedure for GA is,
STEP 1:
Represent the each possible solution as a
chromosome of a fixed length, choose the chromosome
population N, the crossover probability and mutation
probability.
STEP 2:
Define a fitness function to measure the performance
of an individual chromosome. The fitness is usually the value
of the objective function in the optimization problem being
solved.
STEP 3:
Randomly generate an initial population of
chromosomes of size N:
x1, x2,……, xN
STEP 4:
Calculate the fitness of each individual chromosome:
(Impact Factor: 0.997)
799
International Journal of Emerging Technology & Research
Volume 1, Issue 4, May-June, 2014
f(x1), f(x2),……, f(xN)
(www.ijetr.org) ISSN (E): 2347-5900 ISSN (P): 2347-6079
(j)=
STEP 5:
Create a new population by repeating the following
steps until the criteria met
 Selection- Select 2 parent chromosomes from a
population according to their fitness.
 Crossover- With a crossover probability crossover
the parents to form a new offspring.
 Mutation- With a mutation probability mutate new
offspring.
 Test- If the end condition is satisfied such as reached
the minimum solution criteria, found the optimal
solution, time or money could be the reason or fixed
number of generations reached then stop and return
the best solution in current population.
 Loop- Go to step 2.
3.2.2 Intelligent Water Drops (IWD)
IWD is an optimization algorithm inspired from natural water
drops which change their environment to find the near optimal
or optimal path to their destination. One property of a water
drop flowing in a river is its velocity. It is assumed that each
water drop of a river can also carry an amount of soil. This
soil is usually transferred from fast parts of the path to the
slow parts. As the fast parts get deeper by being removed from
soil they can hold more volume of water and thus may attract
more water. An amount of soil of the river’s bed is removed
by the water drop and this removed soil is added to the soil of
the water drop. Moreover, the speed of the water drop is
increased during the transition. Another property of a natural
water drop is that when it faces several paths in the front and it
often chooses the easier path. Therefore the following
statement may be expressed as a water drop prefers a path
with less soil than a path with more soil.
The procedure for IWD is,
STEP 1:
Initialization of static parameters and dynamic
parameters.
STEP 2:
Spread the IWDs randomly on the nodes of the graph
as their first visited nodes.
STEP 3:
Update the visited node list of each IWD to include
the nodes just visited.
STEP 4:
Repeat steps 5.1 to 5.4 for those IWDs with partial
solutions.
STEP 5.1:
For the IWD residing in node i, choose the next node j, which
doesn’t violate any constraints of the problem and is not in the
visited node list vc(IWD) of the IWD, by using the following
probability
(j):
© Copyright reserved by IJETR
(1)
such that f(soil(i,j))=
(2) and
g(soil(i,j))=
(3)
where
is a small positive number which prevents division
by zero in function f(.).
represents the amount of soil
on the output link l of node i. If an IWD selects the link above
node i is added to its set S.
STEP 5.2:
For each IWD moving from node i to node j, update velocity
(t) by
(t+1)=
(t) +
(4)
where
(t+1) is the updated velocity of the IWD. ,
and are constant parameters.
STEP 5.3:
For the IWD moving on the path from node i to j,
compute the soil Δsoil (i, j) that the IWD loads from the path
by Δsoil (i, j) =
(5)
=
(6)
where the heuristic undesirability HUD (i, j) is defined
appropriately for the given problem. The
represents the time is needed to travel from node i to node i+1.
, and are constant parameters.
STEP 5.4:
Update the soil soil (i, j) of the path from node i to j traversed
by that IWD and also update the soil that the IWD carries
by
soil(i,j)= (1- ) * soil(i,j)* Δsoil (i, j) and
=
+ Δsoil (i, j)
(7)
where
is the
amount of soil that the IWD carries and
is a constant
parameter.
STEP 6:
Find the iteration-best solution
from all the solutions
found by the IWDs using
=
q(
)
(8)
where function q (.) gives the quality of the solution.
STEP 7:
Update the soils on the paths that form the current iterationbest solution
Soil (i,j)= (1+
* soil (i,j) *
(i,j)
∈
where
solution
STEP 8:
(9)
is the number of nodes (selected features) in the
.
(Impact Factor: 0.997)
800
International Journal of Emerging Technology & Research
(www.ijetr.org) ISSN (E): 2347-5900 ISSN (P): 2347-6079
Volume 1, Issue 4, May-June, 2014
Update the total best solution
by the current iteration-best
solution using
=
Dataset from UCI
(10)
STEP 9:
The algorithm stops here with the total-best solution
. The
search will terminate if the global iteration has been reached.
Randomly generate
initial population
Initialization of
static and dynamic
parameters
3.3 Support Vector Machine (SVM)
Support Vector Machines are supervised learning models with
related learning algorithms that evaluate data and recognize
patterns which are used for classification and analysis. The
simple SVM takes a set of input data and expects for each
given input which of two possible classes forms the output
building it a non-probabilistic binary linear classifier. Given a
set of training cases each marked as belonging to one of two
categories an SVM training algorithm builds a model that
allots new cases into one category or the other.
Evaluate all
individuals
Stop?
Best
indiv
idual
s
Generate IWD
randomly
Update velocity
(
)
Selection
Result
3.4 Hybridization of GA and IWD
Selection
The process of hybridization is shown in fig.1.
Crossover
[Step 1] Initialize all GA variables.
[Step 2] Initialize all IWD variables.
[Step 3] Find the best solution in IWD.
[Step 4] Evaluate the fitness function in GA.
[Step 5] Perform selection and introduce the best solution of
IWD into crossover.
[Step 6] Perform crossover in GA.
[Step 7] Perform mutation in GA.
[Step 8] If condition of GA is satisfied with the target
condition (iteration number or target value), reproduction
procedure is halted.
Update soil
(
)
Selection
Mutation
Selection
SVM Classifier
Find best
solution
Result
Breast cancer
present/not
Fig. 1 Hybridization of GA and IWD
3.5 Classification Performance Metrics
In a classification problem, the results are labeled as positive
(P) or negative (N).The potential outcomes
are regularly defined in statistical learning as true positive
(TP), false positive (FP), true negative (TN) and false negative
(FN). These four results are linked to each other with a table
that is often called as confusion matrix.
Accuracy (ACC): ACC is a broadly used metric to define class
discrimination ability of classifiers and it is evaluated using
ACC = (TP +TN) / (P + N) where P is the positive, N is the
negative, TP is the true positive, TN is the true negative and
ACC is accuracy.
© Copyright reserved by IJETR
(Impact Factor: 0.997)
801
International Journal of Emerging Technology & Research
Volume 1, Issue 4, May-June, 2014
(www.ijetr.org) ISSN (E): 2347-5900 ISSN (P): 2347-6079
t,
Mutationu
niform
4. Implementation and Discussion
4.1 Dataset
For estimating the model, Wisconsin Diagnostic
Breast Cancer (WDBC) Dataset is used. Each record of this
dataset is signified with 30 numerical features. Features are
calculated from a digitized image of a fine needle aspirate
(FNA) of a breast mass. They define the characteristics of the
cell nuclei current in the image. The diagnosis of each record
is “benign” or “malignant”. The options used in GA are
gacreationuniform,selectionremainder,crossovertwopoint,muta
tionadaptfeasible,
gacreation
linearfeasible,
selectionuniform,selectionroulette,crossoversinglepoint,
mutation uniform. The numbers of generations used are 15,
20, 30, 40, 50. The options used in IWD are av, bv, cv, as, bs,
cs, localsoil, globalsoil, initialsoil, initialvelocity, epsilon. The
parameters values for av is 1, bv are 1,0.01, cv are 1, 0.01, as
is 1, bs are 1, 0.01, cs are 1, 0.01, localsoil are 0.3, 0.6, 0.7,
0.9, globalsoil are 0.3, 0.6, 0.7, 0.9, initialsoil are 10000, 1000,
initialvelocity are 4, 5, epsilon are 0.0001, 0.0002, 0.0003. The
table I shows the comparison of accuracy rates for different
options of GA. The table II shows the comparison of accuracy
rates for different options of IWD.
Gacreation
linear
feasible,
Selectionu
niform,
Crossover
singlepoin
t,
Mutationu
niform
20
30
40
50
15
16
11
15
22
18
0.9091
0.9545
0.9545
1
0.9091
1
0.8636
0.9091
1
0.9545
20
30
40
50
15
11
16
15
0.9091
0.8636
0.9545
1
0.7727
1
0.9091
0.9091
Table 1: Comparison of Accuracy Rates between Different Options of GA
Table 2: Comparison of Accuracy Rates between Different Options of IWD
Options
Gacreation
uniform,
Selectionr
emainder,
Crossovert
wopoint,
Mutationa
daptfeasibl
e
Gacreation
uniform,
Selectionu
niform,
Crossover
singlepoin
No of
generati
ons
15
20
30
40
50
15
Number
of
features
selected
21
19
16
19
11
16
Accur
acy
0.9545
0.9091
0.9545
1
0.9545
1
© Copyright reserved by IJETR
Accurac
y with 30
features(
Without
feature
selection
)
0.9091
0.9545
0.8636
0.9545
0.9545
0.9091
Options
No of
generation
s
av=1;bv=0.01;cv=1
;
as=1;bs=0.01;cs=1;
localSoil=0.3;
globalSoil=0.3;
initialSoil=10000;
initialVelocity=4;
epsilon=0.0001
av=1;bv=1;cv=0.01
;
as=1;bs=1;cs=0.01;
localSoil=0.6;
globalSoil=0.6;
initialSoil=1000;
initialVelocity=5;
epsilon=0.0002
(Impact Factor: 0.997)
Accuracy
44
Numbe
r of
feature
s
selected
8
16
20
30
40
50
44
10
9
11
7
12
12
1
1
0.9545
1
0.9091
0.9091
0.9545
802
International Journal of Emerging Technology & Research
Volume 1, Issue 4, May-June, 2014
av=1;bv=0.01;cv=0
.01;
as=1;bs=0.01;cs=0.
01;
localSoil=0.9;
globalSoil=0.9;
initialSoil=1000;
initialVelocity=4;
epsilon=0.0003
av=1;bv=1;cv=1;
as=1;bs=1;cs=1;
localSoil=0.7;
globalSoil=0.7;
initialSoil=10000;
initialVelocity=5;
epsilon=0.0003
16
20
30
40
50
44
11
8
9
11
13
10
(www.ijetr.org) ISSN (E): 2347-5900 ISSN (P): 2347-6079
obtained. An additional innovation of this study is to diagnosis
of the rare diseases that can be diagnosed by hybrid GA-IWD
for feature selection and to increase the accuracy rate still
higher.
0.9594
1
0.9594
1
1
0.9594
16
20
30
40
50
44
11
9
9
14
8
13
0.9091
1
0.9594
1
1
0.9091
16
20
30
40
50
8
9
14
11
7
1
0.9091
1
0.9594
1
References
[1] Langley. Selection of relevant features and examples in
machine learning. Artificial Intelligence, 97:245-271,1997.
[2] Liu. Feature Extraction, Construction and Selection: A Data
Mining Perspective. Boston: Kluwer
Academic
Publishers,
1998.2nd Printing,2001.
[3] Mitchell, Melanie. An Introduction to Genetic Algorithms.
MIT press, 1996.
[4] Holland. Adaptation in natural and artificial systems. MIT Press
Cambridge, MA, USA, 1992.
[5] Akin Ozcift. Genetic algorithm wrapped Bayesian network
feature selection applied to differential diagnosis of erythematosquamous diseases. Elsevier,Vol.23, Issue 1, pp. 230-237, January
2013.
[6] Hossein Ghayoumi Zadeh et al . Diagnosis of Breast Cancer
using a Combination of Genetic Algorithm and Artificial Neural
Network in Medical Infrared Thermal Imaging. Iranian Journal of
Medical Physics, Vol.9, No.4, pp. 265-274, 2012.
[7] Nidhi Bhatia et al. An Analysis of Heart Disease Prediction
using Different Data Mining Techniques. International Journal of
Engineering Research & Technology, Vol. 1 Issue 8, October – 2012.
[8] Kerry Seitz et al. Learning lung nodule similarity using a
genetic algorithm. Medical Imaging 2012:
Computer-Aided
Diagnosis, Vol. 8315, Feb 2012.
[9] Yusuf Hendrawan. Neural-Intelligent Water Drops algorithm to
select relevant textural features for developing precision irrigation
system using machine vision. Elsevier, Vol. 77, Issue. 2, pp. 214-228,
July 2011.
[10] Chinh Hoang. Optimal data aggregation tree in wireless sensor
networks based on intelligent water drops algorithm. IET Wireless
Sensor Systems, Vol. 2, Issue. 3, pp. 282–292, May 2012.
4.2 Evaluation
For calculating the accuracy rate of the proposed model,
holdout method is employed. In holdout method, dataset is
divided into two sets. 70% of data is allotted to training set
and the remaining 30% is allotted to test set.
5. Conclusion and Future Work
In this paper, feature selection using GA and IWD for
selecting the best subset of features for breast cancer diagnosis
system is proposed and implemented. GA and IWD are used
to search the problem space to find all of potential subsets of
features and SVM is employed to evaluate the fitness value of
each chromosome. At the end, the best subset of features is
© Copyright reserved by IJETR
(Impact Factor: 0.997)
803
Download