International Journal of Emerging Technology & Research Volume 1, Issue 4, May-June, 2014 (www.ijetr.org) ISSN (E): 2347-5900 ISSN (P): 2347-6079 Hybrid Generic Algorithm-Intelligent Water Drops Based Feature Selection for Breast Cancer Diagnosis S.Sindiya1, S.Gunasundari2 1 PG Scholar, Department of Computer Science and Engineering, Velammal Engineering College, Chennai, India Assistant Professor, Department of Computer Science and Engineering, Velammal Engineering College, Chennai, India 2 Abstract— Clinical diagnosis is done commonly by doctor’s capability and practice. But still cases are conveyed of wrong diagnosis and treatment. Patients are requested to take number of tests for diagnosis. Breast Cancer is one of the serious problems in medical diagnosis and it is the second largest cause of cancer deaths among women. The objective of the improvement of breast cancer identification system is to support the radiologist in the arrangement of tumor as benign or malignant. This project proposes a system which can select the best features. It is a task of detecting and choosing an appropriate subset of features from a larger set of features. The objective of this work is to calculate more precisely the occurrence of breast cancer with reduced number of features and to increase the classification accuracy rates. Hybrid Genetic Algorithm (GA) and Intelligent Water Drops (IWD) are used to decide the attributes which contribute more towards the diagnosis of breast cancer which indirectly decreases the number of tests which are required to be taken by a patient. Support Vector Machine (SVM) classifier is used to classify whether breast cancer is present or not. . Keywords— Breast Cancer, Genetic Algorithm (GA), Intelligent Water Drops (IWD), Support Vector Machine (SVM). I. Introduction The advancements in the field of computer and database technologies, made data accumulation unmatchable by the human’s capacity of data processing. The multidisciplinary joint effort from databases, machine learning and statistics is supporting to turn big data into nuggets. A realization to use data mining tools more effectively is through data processing, most researchers and practitioners had realized this fact too. Feature selection is one of the essential and commonly used methods in data preprocessing [1] , [2]. A "feature" or "attribute" or "variable" denotes to an aspect of the data. © Copyright reserved by IJETR Generally before gathering data, features are identified or selected. Features can be discrete, continuous and nominal. Normally, features are categorized as relevant features which have an impact on the output and their role cannot be expected by the rest, irrelevant features are defined as those features not having any impact on the output and whose values are produced at random, and redundancy exists whenever a feature can take the role of another feature. Computation of the result by using the entire features may not always give the best result because of unnecessary and irrelevant features, also referred as noisy features. To remove these unnecessary features, it is essential to use a feature selection algorithm which selects a subset of important features from the parent set by removing the irrelevant features for simple and accurate data. By reducing the unwanted and irrelevant features it is possible to reduce the size of the set. This is valuable as it increases the classification accuracy and reduces the computational cost and also it removes the risk of overfitting.Holland [3] developed GA as a generalization of natural evolution and provided a theoretical structure for variation under GA. GA is fundamentally used as a problem solving strategy in order to provide with an optimal solution [4]. GA includes a subset of the growth-based optimization methods aiming at the use of the GA operators such as selection, mutation and recombination to a population of challenging problem solutions. IWD is an optimization algorithm motivated from natural water drops which alter their environment to discover the near optimal or optimal path to their target. The memory is the river’s bed and what is changed by the water drops is the quantity of soil on the river’s bed.The rest of the sections are organized in the following manner. Section 2 specifies about literature review based on GA and IWD. Section 3.2.1 discusses on reducing the number of attributes using GA. Section 3.2.2 discusses on reducing the number of attributes using IWD. Section 3.3 reviews basic SVM concepts. Section 3.4 describes the hybridization of GA and IWD. Section 3.5 discusses on Classification Performance Metrics. Section 4 presents experimental results from using the proposed method to (Impact Factor: 0.997) 798 International Journal of Emerging Technology & Research (www.ijetr.org) ISSN (E): 2347-5900 ISSN (P): 2347-6079 Volume 1, Issue 4, May-June, 2014 diagnose breast cancer. Finally, Section 5 concludes the paper along with outlining future work. 2. Related Works 2.1 Overview of Genetic Algorithm for Different Applications Akin Ozcift et al, [5] have determined a GA wrapped Bayesian network feature selection for the diagnosis of erythemato-squamous diseases. It has been resulted in 99.20% accuracy by Bayesian Network (BN). Then, it is moreover tested with the other classifiers such as Support Vector Machine (SVM), Multi-Layer Perceptron (MLP), Simple Logistics (SL) and Functional Decision Tree (FT). The classification accuracies obtained by SVM is 98.36%, MLP is 97.00%, SL is 98.36% and FT is 97.81% respectively. Hossein Ghayoumi Zadeh, [6] has examined a method for the diagnosis of breast cancer. The input to this model is 8 parameters which are designed using artificial neural network and genetic algorithm. The sensitivity attained by this model was 50% and the specificity obtained by it was 75% and the accuracy obtained is 70%.Nidhi Batla et al, [7] has suggested a study which involved various data mining methods for the prediction of heart disease. Several data mining methods can be used in automated heart disease prediction systems. The result shows that Neural Network by using 15 attributes has presented the highest accuracy. Decision Tree has also achieved well with 99.62% accuracy by using 15 attributes. The GA with Decision Tree has reduced the number of attributes from 15 attributes to 6 attributes, and has the accuracy of 99.2%.Kerry Seitz et al, [8] have examined the model by using GA for learning lung nodule similarity. In order to improve the effectiveness of content-based image retrieval (CBIR), it is necessary to decide an optimal combination of image features that can be used in defining the relationship between images. The GA is used to optimize the combination of features. The accuracy of the CBIR model is increased as the number of image features reduced. The classification accuracy obtained with this model is 86.91%. 2.2 Overview of Intelligent Water Drops Algorithm for Different Applications Among the latest nature-motivated swarm-based optimization algorithms is the Intelligent Water Drops (IWD) algorithm. IWD algorithms mimic few of the methods that happen in nature between the water drops of a river and the soil of the river bed. The IWD algorithm was first presented in 2007 in which the IWDs are used to solve the Travelling Salesman Problem (TSP). The IWD algorithm has also been effectively applied to the Multidimensional Knapsack Problem (MKP), nqueen puzzle and Robot Path Planning.Yusuf Hendrawan et © Copyright reserved by IJETR al, [9] implemented a study to improve nature-inspired feature selection methods to discover the most important set of Textural Features (TFs) suitable for guessing water content of cultured Sunagoke moss. This process is linked with NeuralSimulated Annealing (N-SA), Neural-Genetic Algorithms (NGAs) and Neural-Discrete Particle Swarm Optimization (NDPSO). 36 features attained from N-IWD with Root Mean Square Error of 1.07 * .Chinh Hoang et al, [10] presented a model which is used to build the best data aggregation trees for the wireless sensor networks. The best total number of hop count, achieved is 83 and the average total number of hop count in the data aggregation tree, is 84.3. 3. System Study 3.1 Introduction The purpose of the design phase is to plan out a system that meets the requirements defined in the analysis phase. It defines the means of implementing the project solution. In the design phase the architecture is established. This phase starts with the requirement document delivered by the requirement phase and maps the requirements into architecture. The architecture defines the components, their interfaces and behaviors. 3.2 Hybrid Genetic Algorithm (GA) and Intelligent Water Drops (IWD) 3.2.1 Genetic Algorithm (GA) GA and IWD are useful for feature selection. GA is a search heuristic that mimics the process of natural selection. The basic purpose is optimization. It belongs to the larger class of evolutionary algorithm. A GA is a search technique used in computing to find true or approximate solutions to optimization and search problems. The procedure for GA is, STEP 1: Represent the each possible solution as a chromosome of a fixed length, choose the chromosome population N, the crossover probability and mutation probability. STEP 2: Define a fitness function to measure the performance of an individual chromosome. The fitness is usually the value of the objective function in the optimization problem being solved. STEP 3: Randomly generate an initial population of chromosomes of size N: x1, x2,……, xN STEP 4: Calculate the fitness of each individual chromosome: (Impact Factor: 0.997) 799 International Journal of Emerging Technology & Research Volume 1, Issue 4, May-June, 2014 f(x1), f(x2),……, f(xN) (www.ijetr.org) ISSN (E): 2347-5900 ISSN (P): 2347-6079 (j)= STEP 5: Create a new population by repeating the following steps until the criteria met Selection- Select 2 parent chromosomes from a population according to their fitness. Crossover- With a crossover probability crossover the parents to form a new offspring. Mutation- With a mutation probability mutate new offspring. Test- If the end condition is satisfied such as reached the minimum solution criteria, found the optimal solution, time or money could be the reason or fixed number of generations reached then stop and return the best solution in current population. Loop- Go to step 2. 3.2.2 Intelligent Water Drops (IWD) IWD is an optimization algorithm inspired from natural water drops which change their environment to find the near optimal or optimal path to their destination. One property of a water drop flowing in a river is its velocity. It is assumed that each water drop of a river can also carry an amount of soil. This soil is usually transferred from fast parts of the path to the slow parts. As the fast parts get deeper by being removed from soil they can hold more volume of water and thus may attract more water. An amount of soil of the river’s bed is removed by the water drop and this removed soil is added to the soil of the water drop. Moreover, the speed of the water drop is increased during the transition. Another property of a natural water drop is that when it faces several paths in the front and it often chooses the easier path. Therefore the following statement may be expressed as a water drop prefers a path with less soil than a path with more soil. The procedure for IWD is, STEP 1: Initialization of static parameters and dynamic parameters. STEP 2: Spread the IWDs randomly on the nodes of the graph as their first visited nodes. STEP 3: Update the visited node list of each IWD to include the nodes just visited. STEP 4: Repeat steps 5.1 to 5.4 for those IWDs with partial solutions. STEP 5.1: For the IWD residing in node i, choose the next node j, which doesn’t violate any constraints of the problem and is not in the visited node list vc(IWD) of the IWD, by using the following probability (j): © Copyright reserved by IJETR (1) such that f(soil(i,j))= (2) and g(soil(i,j))= (3) where is a small positive number which prevents division by zero in function f(.). represents the amount of soil on the output link l of node i. If an IWD selects the link above node i is added to its set S. STEP 5.2: For each IWD moving from node i to node j, update velocity (t) by (t+1)= (t) + (4) where (t+1) is the updated velocity of the IWD. , and are constant parameters. STEP 5.3: For the IWD moving on the path from node i to j, compute the soil Δsoil (i, j) that the IWD loads from the path by Δsoil (i, j) = (5) = (6) where the heuristic undesirability HUD (i, j) is defined appropriately for the given problem. The represents the time is needed to travel from node i to node i+1. , and are constant parameters. STEP 5.4: Update the soil soil (i, j) of the path from node i to j traversed by that IWD and also update the soil that the IWD carries by soil(i,j)= (1- ) * soil(i,j)* Δsoil (i, j) and = + Δsoil (i, j) (7) where is the amount of soil that the IWD carries and is a constant parameter. STEP 6: Find the iteration-best solution from all the solutions found by the IWDs using = q( ) (8) where function q (.) gives the quality of the solution. STEP 7: Update the soils on the paths that form the current iterationbest solution Soil (i,j)= (1+ * soil (i,j) * (i,j) ∈ where solution STEP 8: (9) is the number of nodes (selected features) in the . (Impact Factor: 0.997) 800 International Journal of Emerging Technology & Research (www.ijetr.org) ISSN (E): 2347-5900 ISSN (P): 2347-6079 Volume 1, Issue 4, May-June, 2014 Update the total best solution by the current iteration-best solution using = Dataset from UCI (10) STEP 9: The algorithm stops here with the total-best solution . The search will terminate if the global iteration has been reached. Randomly generate initial population Initialization of static and dynamic parameters 3.3 Support Vector Machine (SVM) Support Vector Machines are supervised learning models with related learning algorithms that evaluate data and recognize patterns which are used for classification and analysis. The simple SVM takes a set of input data and expects for each given input which of two possible classes forms the output building it a non-probabilistic binary linear classifier. Given a set of training cases each marked as belonging to one of two categories an SVM training algorithm builds a model that allots new cases into one category or the other. Evaluate all individuals Stop? Best indiv idual s Generate IWD randomly Update velocity ( ) Selection Result 3.4 Hybridization of GA and IWD Selection The process of hybridization is shown in fig.1. Crossover [Step 1] Initialize all GA variables. [Step 2] Initialize all IWD variables. [Step 3] Find the best solution in IWD. [Step 4] Evaluate the fitness function in GA. [Step 5] Perform selection and introduce the best solution of IWD into crossover. [Step 6] Perform crossover in GA. [Step 7] Perform mutation in GA. [Step 8] If condition of GA is satisfied with the target condition (iteration number or target value), reproduction procedure is halted. Update soil ( ) Selection Mutation Selection SVM Classifier Find best solution Result Breast cancer present/not Fig. 1 Hybridization of GA and IWD 3.5 Classification Performance Metrics In a classification problem, the results are labeled as positive (P) or negative (N).The potential outcomes are regularly defined in statistical learning as true positive (TP), false positive (FP), true negative (TN) and false negative (FN). These four results are linked to each other with a table that is often called as confusion matrix. Accuracy (ACC): ACC is a broadly used metric to define class discrimination ability of classifiers and it is evaluated using ACC = (TP +TN) / (P + N) where P is the positive, N is the negative, TP is the true positive, TN is the true negative and ACC is accuracy. © Copyright reserved by IJETR (Impact Factor: 0.997) 801 International Journal of Emerging Technology & Research Volume 1, Issue 4, May-June, 2014 (www.ijetr.org) ISSN (E): 2347-5900 ISSN (P): 2347-6079 t, Mutationu niform 4. Implementation and Discussion 4.1 Dataset For estimating the model, Wisconsin Diagnostic Breast Cancer (WDBC) Dataset is used. Each record of this dataset is signified with 30 numerical features. Features are calculated from a digitized image of a fine needle aspirate (FNA) of a breast mass. They define the characteristics of the cell nuclei current in the image. The diagnosis of each record is “benign” or “malignant”. The options used in GA are gacreationuniform,selectionremainder,crossovertwopoint,muta tionadaptfeasible, gacreation linearfeasible, selectionuniform,selectionroulette,crossoversinglepoint, mutation uniform. The numbers of generations used are 15, 20, 30, 40, 50. The options used in IWD are av, bv, cv, as, bs, cs, localsoil, globalsoil, initialsoil, initialvelocity, epsilon. The parameters values for av is 1, bv are 1,0.01, cv are 1, 0.01, as is 1, bs are 1, 0.01, cs are 1, 0.01, localsoil are 0.3, 0.6, 0.7, 0.9, globalsoil are 0.3, 0.6, 0.7, 0.9, initialsoil are 10000, 1000, initialvelocity are 4, 5, epsilon are 0.0001, 0.0002, 0.0003. The table I shows the comparison of accuracy rates for different options of GA. The table II shows the comparison of accuracy rates for different options of IWD. Gacreation linear feasible, Selectionu niform, Crossover singlepoin t, Mutationu niform 20 30 40 50 15 16 11 15 22 18 0.9091 0.9545 0.9545 1 0.9091 1 0.8636 0.9091 1 0.9545 20 30 40 50 15 11 16 15 0.9091 0.8636 0.9545 1 0.7727 1 0.9091 0.9091 Table 1: Comparison of Accuracy Rates between Different Options of GA Table 2: Comparison of Accuracy Rates between Different Options of IWD Options Gacreation uniform, Selectionr emainder, Crossovert wopoint, Mutationa daptfeasibl e Gacreation uniform, Selectionu niform, Crossover singlepoin No of generati ons 15 20 30 40 50 15 Number of features selected 21 19 16 19 11 16 Accur acy 0.9545 0.9091 0.9545 1 0.9545 1 © Copyright reserved by IJETR Accurac y with 30 features( Without feature selection ) 0.9091 0.9545 0.8636 0.9545 0.9545 0.9091 Options No of generation s av=1;bv=0.01;cv=1 ; as=1;bs=0.01;cs=1; localSoil=0.3; globalSoil=0.3; initialSoil=10000; initialVelocity=4; epsilon=0.0001 av=1;bv=1;cv=0.01 ; as=1;bs=1;cs=0.01; localSoil=0.6; globalSoil=0.6; initialSoil=1000; initialVelocity=5; epsilon=0.0002 (Impact Factor: 0.997) Accuracy 44 Numbe r of feature s selected 8 16 20 30 40 50 44 10 9 11 7 12 12 1 1 0.9545 1 0.9091 0.9091 0.9545 802 International Journal of Emerging Technology & Research Volume 1, Issue 4, May-June, 2014 av=1;bv=0.01;cv=0 .01; as=1;bs=0.01;cs=0. 01; localSoil=0.9; globalSoil=0.9; initialSoil=1000; initialVelocity=4; epsilon=0.0003 av=1;bv=1;cv=1; as=1;bs=1;cs=1; localSoil=0.7; globalSoil=0.7; initialSoil=10000; initialVelocity=5; epsilon=0.0003 16 20 30 40 50 44 11 8 9 11 13 10 (www.ijetr.org) ISSN (E): 2347-5900 ISSN (P): 2347-6079 obtained. An additional innovation of this study is to diagnosis of the rare diseases that can be diagnosed by hybrid GA-IWD for feature selection and to increase the accuracy rate still higher. 0.9594 1 0.9594 1 1 0.9594 16 20 30 40 50 44 11 9 9 14 8 13 0.9091 1 0.9594 1 1 0.9091 16 20 30 40 50 8 9 14 11 7 1 0.9091 1 0.9594 1 References [1] Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence, 97:245-271,1997. [2] Liu. Feature Extraction, Construction and Selection: A Data Mining Perspective. Boston: Kluwer Academic Publishers, 1998.2nd Printing,2001. [3] Mitchell, Melanie. An Introduction to Genetic Algorithms. MIT press, 1996. [4] Holland. Adaptation in natural and artificial systems. MIT Press Cambridge, MA, USA, 1992. [5] Akin Ozcift. Genetic algorithm wrapped Bayesian network feature selection applied to differential diagnosis of erythematosquamous diseases. Elsevier,Vol.23, Issue 1, pp. 230-237, January 2013. [6] Hossein Ghayoumi Zadeh et al . Diagnosis of Breast Cancer using a Combination of Genetic Algorithm and Artificial Neural Network in Medical Infrared Thermal Imaging. Iranian Journal of Medical Physics, Vol.9, No.4, pp. 265-274, 2012. [7] Nidhi Bhatia et al. An Analysis of Heart Disease Prediction using Different Data Mining Techniques. International Journal of Engineering Research & Technology, Vol. 1 Issue 8, October – 2012. [8] Kerry Seitz et al. Learning lung nodule similarity using a genetic algorithm. Medical Imaging 2012: Computer-Aided Diagnosis, Vol. 8315, Feb 2012. [9] Yusuf Hendrawan. Neural-Intelligent Water Drops algorithm to select relevant textural features for developing precision irrigation system using machine vision. Elsevier, Vol. 77, Issue. 2, pp. 214-228, July 2011. [10] Chinh Hoang. Optimal data aggregation tree in wireless sensor networks based on intelligent water drops algorithm. IET Wireless Sensor Systems, Vol. 2, Issue. 3, pp. 282–292, May 2012. 4.2 Evaluation For calculating the accuracy rate of the proposed model, holdout method is employed. In holdout method, dataset is divided into two sets. 70% of data is allotted to training set and the remaining 30% is allotted to test set. 5. Conclusion and Future Work In this paper, feature selection using GA and IWD for selecting the best subset of features for breast cancer diagnosis system is proposed and implemented. GA and IWD are used to search the problem space to find all of potential subsets of features and SVM is employed to evaluate the fitness value of each chromosome. At the end, the best subset of features is © Copyright reserved by IJETR (Impact Factor: 0.997) 803