International Journal of Engineering Trends and Technology- Volume3Issue5- 2012 K Best cluster based neighbour Classification using Improved DDG approaches Basheera.Shaik#1, G.Vijayadeep#2 1 2 M.Tech (CSE),Gudlavalleru Engineering College, Gudlavalleru Associate Professor, Gudlavalleru Engineering College, Gudlavalleru. ABSTRACT: Data classification has attracted considerable research relating to computational statistics and data mining from its wide range of applications. k Nearest Neighbor (KNN) is widely used in classification datasets for dealing with the more difficult problems such as large-scale or different data categories. But KNN classifier may have a problem when training samples are uneven. The problem is that KNN classifier may decrease the precision of classification due to uneven density of training data. Clustering is applied as a possible initial step within each class to discover the n-class grouping within the dataset. Different data clustering techniques use different similarity measures which maintains its own strength and weakness. Thus, utilizing the three cluster measures may be helped by the effectiveness of each of them and eliminate problems of using an individual cluster measure. To resolve the problem, a new K Best Cluster Based Neighbour (KB-CB-N) is our present classification procedure based upon the implementation of three different similarity measures for cluster based classification. The basic is to use unsupervised learning by the instances for each class supplied in the dataset and after that use the output as an input for the classification algorithm to discover the K best neighbours of groups direct from density, gravity and distance view points. Experimental results will shows proposed algorithm gives better performance when compare to existing knn approach. I INTRODUCTION Data mining involves the utilization of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data set. These power tools can include statistical models, mathematical algorithm and machine learning methods. Consequently, data mining consists of more than collection and managing data, it also includes analysis and prediction[3,5]. Most classification algorithms perform batch classification, that is, they use the dataset directly in memory. More intricate methods often build large data structures which give them a greater memory footprint. This larger footprint prevents them away from being put on the majority of the larger datasets. Even though these datasets can be utilized, the complexness of one's algorithm are able to make the classification task spend time in an inordinately period of time. There are resolutions to those problems as adding more memory or waiting longer for experiments to finish. However both feature a cost in money and time and both will possibly not ultimately solve the challenge in the event the dataset cannot in memory as soon as the memory is at a maximum. A feature of most large datasets is the lot of of data points that contain similar information. So one option would be to remove redundant information from the dataset to get a classifier to focus on data points that represent a larger group of points. This allows the building of a classier that correctly models the relationship expressed by the results in the dataset. Random sampling is not too difficult way to remove redundancy and could be very effective on uniformly distributed data[6]. Developing an understanding of the applicati on form domain. the relevant prior knowledge. the goal s of the end-user Setting up a target data set: selecting a data set, or specializing in a subset of variables, or data samples, on whi ch discovery i s goi ng to be performed. Data cleaning and preprocessing. Removal of noise or outliers. Collecting required informati on to model or account for noise. Strategies for handling missing data fields. Discussing time sequence information and known changes[4]. Data reducti on and projection. ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 622 International Journal of Engineering Trends and Technology- Volume3Issue5- 2012 Finding useful features to represent the results depending on the goal of the t ask. Using dimensionality reduction or transformation techniques reduce the effective number of variables into account and to find invariant represent ati ons for the data. Choosing the data mining task Deciding whether the goal of the KDD process is classification, regression, clustering, etc. Making sure you pick the data mining algorithm(s). Selecting method(s) to be utili zed for searching for patterns within the data. Deciding which models and parameters can be appropri ate. Matching an exact data mining method along with the overall criteria as to the KDD process. Data mining. Searching for patterns of curiosity in a particul ar representational form or possibly a set of such representations as classification rules or trees, regression, clustering, and so forth. Interpreting mined patterns. Consolidating discovered knowledge. Classification techniques have attracted the interest of researchers on account of the significance of their applications . Various methods encompassng decision trees, rule based methods, and neural networks are employed for the classification problems. KNN is not a demanding, but yet effective classification method. The most ideal idea is finding K nearest instances within the training sample to classify any unlabelled data instance. KNN has also been chosen by the data mining community on the list of top 10 data mining algorithms . However, there are a number of problems that decrease overall performance KNN. One of these problems that has got a clear negative impact on the classification performance of KNN is considered the utilization of standard Euclidean distance in finding the nearest neighbours. The classification decision for unlabelled samples relies upon sub-clusters in comparison to instances. That directly assists enhancing the classification performance and reducing the effect of noisy data. Moreover, that helps discover hidden patterns and categories within individual classes. The moment contribution is based on the development of all the similarity measure in KNN. While traditional KNN classification techniques typically employ Euclidean distance to assess pattern similarity and choose the nearest neighbour, other measures may be utilised to further improve the accuracy and realize the best neighbour among subclusters for all labelled classes in contrast to nearest one. We coined our novel technique as K Best Cluster Based Neighbour (KB-CB-N)[1]. II BACKGROUND AND RELATED WORK Many techniques have been applied for classification, including decision trees , neural network (NN) , supportvector machine (SVM) , K nearest neighbour (KNN) and many other techniques. K nearest neighbour has been widely used as an efficient classification model; however it has many shortcomings . Many methods have been developed to improve the KNN performance, including Weight Adjusted K-Nearest-Neighbor (WAKNN), Dynamic KNearest- Neighbor (DKNN) , K-Nearest-Neighbour with Distance Weighted (KNNDW)[2,5], and K-NearestNeighbour Nave Bayes (KNNNB). The main contributions of the above techniques are how to improve the distance similarity measure function, select neighbour size, and enhance voting System. KNN is considered an Instance Based Learning method IBL which predicts the label of little test samples by voting among K individual training instances. Alternatively, some techniques appear to have been developed to exchange the individual training samples by multitude of clusters as with [9] in an effort to improve the classification performance. However, the similarity measure applied to this class-based clustering algorithm is just distance but not considering merging other similarity measures like density and gravity. The blending of density, gravity and distance similarity measures have been first introduced in the following earlier perform [2], but particularly for clustering purposes. Existing system Problems: Existing knn approach takes user defined k values in order to get k nearest dataobjects. Existing algorithm does not give effective results if the data contains different datatypes with missing values. Existing knn approach does not use clustering algorithm before classification. Existing clustering approaches does not use three similarity measures distance,density and gravity based measures. Existing algorithms does not classify if the data is unevenly distributed. 3. PROPOSED FRAMEWORK Data Preprocessing: Incomplete, noisy, and inconsistent data are commonplace properties of huge realworld databases and data warehouses. Incomplete data can occur and get a number of reasons. Attributes of interest might not constantly be available. Other data will not be included just because it was not considered important at the time of entry. Relevant data will not be recorded mainly because of an misunderstanding, or on account of equipment malfunctions. Data which were inconsistent with other recorded data could have been deleted[8-10]. Furthermore, the recording of one's history or modifications to the data could possibly have been overlooked. Missing data, particularly for tuples with missing values for some attributes, may have to remain inferred. There are many possible causes for noisy data (having incorrect attribute values). The comprehensive data collection instruments used may be faulty. There might are now human or computer errors occurring at boring tasks. ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 623 International Journal of Engineering Trends and Technology- Volume3Issue5- 2012 Errors in data transmission can even occur. There could be technology limitations, an example would be limited buffer size for coordinating synchronized data transfer and consumption. Incorrect data may also result from inconsistencies in naming conventions or data codes used, or inconsistent formats for input fields, such as date. Duplicate tuples also require data cleaning. Data cleaning routines act to “clean” the results by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. classification technique is thought to be very efficient in detecting hidden patterns most especially when there will be definite distinguished features into each class. Data cleaning (or data cleansing) routines effort to complete missing values, reduce noise while identifying outliers, and point out inconsistencies in the data 2) Prediction Phase: The 2nd phase of your novel technique is the classification of this particular data objects driven by different sets of sub-clusters produced from the initial phase and presenting different classes. Within this step, the classification process is applied by using mixture of three different similarity measures (distance, density and gravity). Applying different measures when it comes to the classification purpose in contrast to only using an individual is highly significant for producing more efficient classification results. KB-CB-N classification algorithm is composed of two principle phases: clustering and prediction. The contributions as to the proposed algorithm can be summarized over the next couple of: 1) The notion of cluster-based classification greatly improves accuracy of prediction decision just like the classification decision is based on a set of instances instead of individual samples. 2) The clustering technique applied within each class makes a significant effect to find hidden patterns and related features among the many class objects and such consequently leads to higher classification performance. 3) Applying various similarity measures supplied in the prediction phase gives more accurate anticipation by considering not only in the distance metric, but also the distribution and size of the candidate class. 4) Voting among different candidates is over by weighted voting system which considers the rank of candidates by various similarity measures beyond just the count and size of candidate sub-clusters. The two phases of all the proposed algorithm will just be described separately as follow: 1) Clustering phase: In the first phase, EM clustering technique is applied regarding the parts of each class. How many sub-clusters produced in each class and of course the size of each are totally depending on how distinguished are the data objects within each labelled class. The cluster-based Similarity measurements among sub-clusters KB-CB-N uses a combination of three different similarity measures for the classification purpose. For each measure, a set of candidate sub-clusters have been chosen and ranked according to each measure. Then the ranked sub-clusters from each measure are merged together and re-ranked according to their standing among different measures[1]. GetDist( DataObj[], SubClusters) for i = 1 to n SubClusters do Calculate Euclidean distance ED between DataObj[] and SubClustersi Assign ED to DistanceArray[i]/2 end for GetDensity( DataObj[], SubClusters) for i = 1 to n SubClusters do Calculate CurrDens of SubClustersi Add DataObj[] to SubClustersi Calculate ExpDens of SubClustersi Calculate DensityGain =e^log (ExpDens –CurrDens)*10 Assign DensityGain to DensityArray[i] Remove DataObj[] from SubClustersi end for ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 624 International Journal of Engineering Trends and Technology- Volume3Issue5- 2012 GetGravity(DataObj[], SubClusters) for i = 1 to n SubClusters do Calculate Distance Dis between DataObj[] and SubClustersi centre Calculate GravitationalForce Fg of SubClustersi Add Fg to ModGravityArr end for BAGGING( K Candidate Sub-Clusters) for i = 1 to n Classes do Count J SubClusters among k candidates assigned to classi Calculate averageRank avgRankj Of J subclusters Calculate weightedRank WRank= avgeRankj / J if WRank ≤ tempWeight then BestCandidate BCand= i tempWeight= WRank end if end for Return BestCandidate BCand. 5. EXPERIMENTAL RESULTS All experiments were performed with the configurations Intel(R) Core(TM)2 CPU 2.13GHz, 2 GB RAM, and the operating system platform is Microsoft Windows XP Professional (SP2). KBCBN APPROACH nearest neighbour(s) for classification Time taken to build model: 0.3 seconds Time taken to test model on training data: 0.13 seconds === Error on training data === Correctly Classified Instances 57 100 % Incorrectly Classified Instances 0 0 % Kappa statistic 1 Mean absolute error 0.0169 Root mean squared error 0.0169 Relative absolute error 3.7085 % Root relative squared error 3.5513 % Coverage of cases (0.95 level) 100 % Mean rel. region size (0.95 level) 50 % Total Number of Instances 57 === Confusion Matrix === a b <-- classified as 16 4 | a = bad 6 31 | b = good Improved KBCBN @attribute duration numeric @attribute wage-increase-first-year numeric @attribute wage-increase-second-year numeric @attribute wage-increase-third-year numeric @attribute cost-of-living-adjustment {none,tcf,tc} @attribute working-hours numeric @attribute pension {none,ret_allw,empl_contr} @attribute standby-pay numeric @attribute shift-differential numeric @attribute education-allowance {yes,no} @attribute statutory-holidays numeric @attribute vacation {below_average,average,generous} @attribute longterm-disability-assistance {yes,no} @attribute contribution-to-dental-plan {none,half,full} @attribute bereavement-assistance {yes,no} @attribute contribution-to-health-plan {none,half,full} @attribute class {bad,good} @attribute cluster {cluster1,cluster2,cluster3,cluster4,cluster5} @data Classifier Model Time taken to build model: 0.31 seconds Time taken to test model on training data: 0.29 seconds === Error on training data === Correctly Classified Instances 57 100 % Incorrectly Classified Instances 0 0 % Kappa statistic 1 Mean absolute error 0 Root mean squared error 0 Relative absolute error 0 % Root relative squared error 0 % Coverage of cases (0.95 level) 100 % Mean rel. region size (0.95 level) 50 % Total Number of Instances 57 === Confusion Matrix === a b <-- classified as 20 0 | a = bad 0 37 | b = good === Stratified cross-validation === Correctly Classified Instances Incorrectly Classified Instances 10 17.5439 % Kappa statistic 0.6235 Mean absolute error 0.1876 Root mean squared error 0.4113 Relative absolute error 41.0144 % Root relative squared error 86.1487 % Coverage of cases (0.95 level) 82.4561 % Mean rel. region size (0.95 level) 50 % Total Number of Instances 57 47 82.4561 % ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 625 International Journal of Engineering Trends and Technology- Volume3Issue5- 2012 === Detailed Accuracy By Class === Size of the tree : 9 TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 0 1 1 1 1 bad 1 0 1 1 1 1 good Weighted Avg. 1 0 1 1 1 1 Time taken to build model: 0.14 seconds Time taken to test model on training data: 0.03 seconds === Error on training data === Correctly Classified Instances 53 92.9825 % Incorrectly Classified Instances 4 7.0175 % Kappa statistic 0.8423 Mean absolute error 0.1386 Root mean squared error 0.2459 Relative absolute error 30.3197 % Root relative squared error 51.5168 % Coverage of cases (0.95 level) 98.2456 % Mean rel. region size (0.95 level) 76.3158 % Total Number of Instances 57 === Confusion Matrix === a b <-- classified as 20 0 | a = bad 0 37 | b = good === Stratified cross-validation === Correctly Classified Instances 53 92.9825 % Incorrectly Classified Instances 4 7.0175 % Kappa statistic 0.8527 Mean absolute error 0.0755 Root mean squared error 0.257 Relative absolute error 16.4975 % Root relative squared error 53.8243 % Coverage of cases (0.95 level) 94.7368 % Mean rel. region size (0.95 level) 51.7544 % Total Number of Instances 57 === Detailed Accuracy By Class === TP Rate ROC Area Class 1 bad 0.892 good Weighted Avg. 0.974 FP Rate Precision Recall F-Measure 0.108 0.833 0 0.93 1 0.038 1 0.892 0.942 0.909 0.974 0.943 0.974 0.93 0.931 === Confusion Matrix === a b <-- classified as 17 3 | a = bad 1 36 | b = good === Stratified cross-validation === Correctly Classified Instances 45 78.9474 % Incorrectly Classified Instances 12 21.0526 % Kappa statistic 0.5378 Mean absolute error 0.2891 Root mean squared error 0.4433 Relative absolute error 63.2058 % Root relative squared error 92.8518 % Coverage of cases (0.95 level) 89.4737 % Mean rel. region size (0.95 level) 84.2105 % Total Number of Instances 57 === Confusion Matrix === === Confusion Matrix === a b <-- classified as 20 0 | a = bad 4 33 | b = good a b <-- classified as 14 6 | a = bad 6 31 | b = good Improved REPTree REPTree ============ wage-increase-first-year < 2.9 | working-hours < 36 : good (2.18/1) [1/0] | working-hours >= 36 : bad (9.82/0.82) [4.32/0.32] wage-increase-first-year >= 2.9 | statutory-holidays < 10.5 | | wage-increase-first-year < 4.25 : bad (3.5/0.5) [1/0] | | wage-increase-first-year >= 4.25 : good (3/0) [2.25/1] | statutory-holidays >= 10.5 : good (19.5/0) [10.43/1] @attribute duration numeric @attribute wage-increase-first-year numeric @attribute wage-increase-second-year numeric @attribute wage-increase-third-year numeric @attribute cost-of-living-adjustment {none,tcf,tc} @attribute working-hours numeric @attribute pension {none,ret_allw,empl_contr} @attribute standby-pay numeric @attribute shift-differential numeric @attribute education-allowance {yes,no} @attribute statutory-holidays numeric ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 626 International Journal of Engineering Trends and Technology- Volume3Issue5- 2012 @attribute vacation {below_average,average,generous} @attribute longterm-disability-assistance {yes,no} @attribute contribution-to-dental-plan {none,half,full} @attribute bereavement-assistance {yes,no} @attribute contribution-to-health-plan {none,half,full} @attribute class {bad,good} @attribute cluster {cluster1,cluster2,cluster3,cluster4,cluster5} Root relative squared error 74.5314 % Coverage of cases (0.95 level) 96.4912 % Mean rel. region size (0.95 level) 79.8246 % Total Number of Instances 57 === Confusion Matrix === a b <-- classified as 15 5 | a = bad 3 34 | b = good @data Classifier Model REPTree ============ 6. CONCLUSION AND FUTURE WORK wage-increase-first-year < 2.9 : bad (12/2) [5.32/1.32] wage-increase-first-year >= 2.9 | cluster = cluster1 : good (13/0) [5/0] | cluster = cluster2 : bad (0/0) [1/0] | cluster = cluster3 : bad (1/0) [0/0] | cluster = cluster4 : good (8/0) [3/0] | cluster = cluster5 | | education-allowance = yes : good (2/0) [1.84/0.5] | | education-allowance = no : bad (2/0) [2.84/1.34] Size of the tree : 10 Time taken to build model: 0.28 seconds Time taken to test model on training data: 0.04 seconds === Error on training data === Correctly Classified Instances 52 91.2281 % Incorrectly Classified Instances 5 8.7719 % Kappa statistic 0.8138 Mean absolute error 0.1434 Root mean squared error 0.2558 Relative absolute error 31.3686 % Root relative squared error 53.6042 % Coverage of cases (0.95 level) 100 % Mean rel. region size (0.95 level) 72.807 % Total Number of Instances 57 === Confusion Matrix === a b <-- classified as 19 1 | a = bad 4 33 | b = good === Stratified cross-validation === Correctly Classified Instances 49 Incorrectly Classified Instances 8 Kappa statistic 0.6846 Mean absolute error 0.2091 Root mean squared error 0.3559 Relative absolute error 45.72 % 85.9649 % 14.0351 % In this paper, we have proposed, developed and evaluated our novel cluster-based K best neighbor classification method based on three new different similarity measures, namely, Moddistance, Moddensity and Modgravity. Using this unsupervised learning method for labeled classes before applying supervised learning improves detection of hidden features inside each class. Experimental results show that Robust KB-CB-N and Robust REPtree achieved better classification accuracy than several efficient classification methods. However, our New KB-CB-N uses the clustering results to perform classification. REFERENCES: [1] KB-CB-N Classification: Towards Unsupervised Approach for Supervised Learning, Zahraa Said Abdallah, Mohamed Medhat Gaber, 2011 IEEE [2] B. E. Boser, I. Guyon, and V. Vapnik, “A training algorithm for optimal margin classifiers,” in Computational Learing Theory, 1992, pp. 144 .Available:http://www.svms.org/training/BOGV92.pdf [3] L. Jiang, Z. Cai, D. Wang, and S. Jiang, “Survey of improving k-nearestneighbor for classification.” in FSKD (1), J. Lei, Ed. IEEE Computer Society, 2007, pp. 679–683. [Online]. Available: http://dblp.unitrier.de/db/conf/fskd/fskd2007-1.html#JiangCWJ07 [4] E.-H. Han, G. Karypis, and V. Kumar, “Text categorization using weight adjusted k-nearest neighbor classification.” in PAKDD, ser. Lecture Notes in Computer Science, D. W.-L. Cheung, G. J. Williams, and Q. Li, Eds., vol. 2035. Springer, 2001, pp. 53–65. [Online]. Available: http://dblp.unitrier.de/db/conf/pakdd/pakdd2001.html# [4] H. Wang, D. A. Bell, and I. Dntsch, “A density based approach to classification.” in SAC. ACM, 2003, pp. 470–474. [Online]. Available: http://dblp.uni-trier.de/db/conf/sac/sac2003.html#WangBD03. [5] Dietterich, T. G., (1997). Machine Learning Research: Four Current Directions AI Magazine. 18 (4), 97-136. [6] Domingos, Pedro (1999): Metacost: A general method for making classifiers cost sensitive, Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, 155–164. [7] Estabrooks, A. (2000): A Combination Scheme for Inductive Learning from Imbalanced Data Sets, MCS Thesis, Faculty of Computer Science, Dalhousie University. [8] J. Hartigan and M. Wong. Algorithm AS136: a k-means clustering algorithms. Applied Statistics, 28:100–108, 1979. [9] Hartigan J.A. Clustering Algorithm. John Wiley and Sons, New York., 1975. [10] J. DeRisi, L. Penland, P.O. Brown, M.L. Bittner, P.S. Meltzer, M. Ray, Y. Chen, Y.A. Su, J.M. Trent. Use of a cDNA microarray to analyse gene expression patterns in human cancer. Nature Genetics, ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 627