advertisement

International Journal of Computer Engineering & Technology (IJCET) Volume 10, Issue 01, January-February 2019, pp. 253–261, Article ID: IJCET_10_01_026 Available online at http://www.iaeme.com/ijcet/issues.asp?JType=IJCET&VType=10&IType=1 Journal Impact Factor (2016): 9.3590(Calculated by GISI) www.jifactor.com ISSN Print: 0976-6367 and ISSN Online: 0976–6375 © IAEME Publication MULTIPLE KERNEL FUZZY CLUSTERING FOR UNCERTAIN DATA CLASSIFICATION Nijaguna GS Research Scholar, VTU Belagaum Dr. Thippeswamy K Professor & Head, Dept. Of CS&E, VTU PG Centre, Regional Office Mysuru, ABSTRACT: Traditional1call tree classifiers work1with information whose values1area unitcelebrated and precise. We1have a tendency to extend1such classifiers to handle information1with unsureinfo. Worth uncertainty1arises in several applications throughoutthe1informationassortmentmethod.Example1source of uncertainty embrace1measurement/quantization errors, information staleness, and multiple recurrent measurements. With1uncertainty, the worth of a knowledge item1is commonlydepicted not by one1single worth, however by1multiple values forming a probability distribution. Instead of1abstracting unsureinformation by applied1math derivatives (such as1mean and median), we have a1tendency to discover that1the accuracy of a call1tree classifier will bea lot1of improved if the “complete1information” of a knowledge item is utilized. Since1process pdf’sis computationall1 a lot of pricey than1process single values (e.g., averages), call tree1construction on unsure information1is more electronic equipment1demanding than that sure information. To1tackle this problem, we have a tendency1to propose a series of pruning1techniques which will greatly improve1construction potency. Key words: Embrace, Probability1distribution. Cite this Article: Nijaguna GS and Dr. Thippeswamy K, Multiple Kernel Fuzzy Clustering For Uncertain Data Classification, International Journal of Computer Engineering and Technology, 10(01), 2019, pp. (253)-(261). http://www.iaeme.com/IJCET/issues.asp?JTypeIJCET&VType=10&IType=1 1. INTRODUCTION A. Data Mining: Data1mining is the extraction of1hidden predictive information from large1databases, is a powerful new technology1with great potential to help companies focus1on the most important information1in their data warehouses. Data1mining tools predict future trends and1behaviors, allowing businesses to1make proactive, knowledge-driven1decisions. The automated, prospective1analyses offered by data mining1move beyond the analyses1of past events provided by retrospective tools1typical of decision support1systems. Data http://www.iaeme.com/IJCET/index.asp 253 [email protected] Nijaguna GS and Dr. Thippeswamy K mining1tools can answer business questions1that traditionally were1too time consuming to1resolve. They scour databases1for hidden patterns, finding1predictive information that experts1may miss because it lies1outside their expectations. Most1companies already collect1and refine massive quantities1of data. . The process1of applying a model to1new data is known as1scoring. Many forms1of data mining are1predictive. For example, a model1might predict income based on1education and other demographic1factors. Predictions have1an associated probability. Prediction1probabilities are also known as1confidence. Other forms1of data mining identify natural1groupings in the data. For1example, a model might1identify the segment of the1population that has an income1within a specified range, that1has a good driving record, and1that leases a new car on1a yearly basis. Data mining1can derive actionable information from large1volumes of data. For example, a town1planner might use a model that predicts1income based on1demographics to develop a plan for1low-income housing. B. The Scope of Data Mining: Data1mining derives its name from the1similarities between searching for valuable1business information in a large database1- for example, finding linked1products in gigabytes of1store scanner data - and1mining a mountain for a vein of valuable1ore. Both processes1require either sifting through an1immense amount of material, or1Intelligently probing it to find1exactly where the value1resides. Given databases of1sufficient size and quality, data1mining technology can1generate new business opportunities1by providing these capabilities: Automated prediction1of trends and behaviors. Data1mining automates the process of1finding predictive information in1large databases. Questions that traditionally1required extensive hands-on1analysis can now be answered directly1from the data quickly. A typical1example of a predictive problem is targeted1marketing. Data mining uses1data on past promotional1mailings to identify the targets most likely to1maximize return on investment1in future mailings. Other1predictive problems include forecasting1bankruptcy and other forms of1default, and identifying segments1of a population likely to respond similarly to1given events. Automated1discovery of previously unknown patterns. Data mining1tools sweep through databases and identify1previously hidden patterns in one1step. An example of1pattern discovery is the analysis1of retail sales data to1identify seemingly unrelated products that1are often purchased together.1Other pattern discovery problems1include detecting fraudulent credit1card transactions and identifying anomalous1data that could represent data entry1keying errors. 2. DECISION TREE Decision1tree learning is a method commonly used1in data mining. The goal1is to create a model that predicts1the value of a target variable1based on several input variables. An1example is shown on the right. Each interior1node corresponds to one of the1input variables; there are edges1to children for each of the possible1values of that input variable. Each1leaf represents a value of the1target variable given the values of the1input variables represented by1the path from the root to the leaf. A1tree can be "learned" by splitting1the source set into1subsets based on an attribute value1test. This process is1repeated on each derived subset1in a recursive manner called1recursive partitioning. The recursion is1completed when the subset at a1node has all the same value1of the target variable, or when splitting1no longer adds value to the1predictions. This process of top-down1induction of decision1trees (TDIDT) is an example1of a greedy algorithm, and it1is by far the most common1strategy for learning decision trees from1data, but it is not the only1strategy. In fact, some approaches1have been developed recently1allowing tree induction to be1performed in a bottom-up fashion. In http://www.iaeme.com/IJMET/index.asp 254 [email protected] Multiple Kernel Fuzzy Clustering For Uncertain Data Classification data1mining, decision trees can1be described also as the combination1of mathematical and1computational techniques to aid1the description, categorization1and generalization of a given set1of data. Figure 1: Sample Decision Tree 3. ALGORITHM 3.1. C4.5 ALGORITHM C4.51is a calculation1used to produce1a decision tree1created by Ross1Quinlan.[1] C4.5 is1an augmentation of1Quinlan's prior ID31calculation. The decision1trees created by C4.51can be utilized1for characterization, and1therefore, C4.5 is1regularly alluded to as1a factual classifier. C4.51constructs decision1trees from an1arrangement of preparing1information similarly as ID3, utilizing1the idea of data1entropy. The preparation1information is an1arrangement of effectively ordered1examples. Each1sample comprises1of a p-dimensional1vector , where1they speak to1traits or highlights1of the example, and1also the class in1which falls. At1every hub of the1tree, C4.5 picks1the property of the1information that most1successfully parts its1arrangement of tests1into subsets enhanced1in one class or1the other. The part1rule is the standardized1data pick1up (contrast in1entropy). The characteristic1with the most1noteworthy standardized1data pick up is1settled on the1decision. The1C4.5 calculation then1recurses on the littler1sublists. This1calculation has a1couple of base1cases. All1the samples in1the list belong to1the same class. At the1point when this1happens, it1basically makes1a leaf hub1for the decision1tree saying to1pick that class. None1of the features1give any data1pick up. For1this situation, C4.51makes a decision1hub higher up1the tree utilizing1the normal estimation1of the class. Instance1of1previously1unseen1class1encountered. Once more, 1C4.5 makes a1decision hub higher1up the tree utilizing1the expected1value. http://www.iaeme.com/IJCET/index.asp 255 [email protected] Nijaguna GS and Dr. Thippeswamy K 3.2. ENTROPY Entropy1is a measure of unpredictability or information1content. To get an1informal, intuitive understanding of1the connection between these1three English terms, consider the1example of a poll on some political1issue. Usually, such polls1happen because the outcome of the1poll isn't already known. In1other words, the outcome of1the poll is relatively unpredictable, and actually1performing the poll and1learning the results gives some new1information; these are just1different ways of saying1that the entropy of the poll1results is large. Now consider1the case that the same poll1is performed a second1time shortly after the first1poll. Since the1result of the first poll is already1known, the outcome of the1second poll can be predicted well1and the results should not1contain any new information; in1this case the entropy1of the second poll results is1small. Now consider the example1of a coin toss. When the1coin is fair, that is, when the probability1of heads is the same as the1probability of tails, then1the entropy of the coin1toss is as high as it1could be. This is1because there is no way to predict1the outcome of the coin toss1ahead of time - the best we1can do is predict that the1coin will come up heads, and1our prediction will be correct1with probability 1/2. Such1a coin toss has one bit of1entropy since there1are two possible outcomes that1occur with equal probability, and learning1the actual outcome contains one1bit of information. Contrarily, a1coin toss with a coin that1has two heads and no tails1has zero entropy since1the coin will always come1up heads, and the outcome1can be predicted perfectly. Most1collections of data in the1real world lie somewhere in1between. English1text has fairly low entropy. In1other words, it is fairly1predictable. Even if we1don't know exactly what1is going to come next, we1can be fairly certain that, for1example, there will be1many more e's than z's, or1that the combination1'qu' will be much more common1than any other combination1with a 'q' in it and the1combination 'th' will be1more common than 'z',1'q', or 'qu'. Uncompressed, English1text has about one bit1of entropy for each1character (commonly encoded as1eight bits) of message. If a compression1scheme is lossless—that1is, you can always recover1the entire original message1by decompressing—then a1compressed message has the same1quantity of information as the1original, but communicated in1fewer bits. That is, it has1more information per bit, or1a higher entropy. This means1a compressed message1is more unpredictable, because there1is no redundancy; each bit1of the message is communicating a1unique bit of information. Roughly speaking,1Shannon's source coding theorem1says that a lossless compression1scheme cannot compress messages, on1average, to have more than1one bit of information per bit of1message. The entropy1of a message multiplied1by the length of that1message is a measure of how1much information the1message contains. 3.3. GAIN RATIO: Data1delivered by1information mining strategies1can be represented1in a wide range1of ways. Decision1tree structures1are a typical1approach1to sort out arrangement1plans. In1ordering assignments, decision1trees visualize1what steps are taken1to arrive at a1characterization. Each1decision tree1starts with what1is named a1root hub, thought1to be the1"parent" of each1other hub. Every1hub in the1tree assesses1a characteristic in1the information1and figures out1which way1it ought to1take after. Normally, the1decision test depends1on contrasting1an value against1some constant. Characterization1utilizing a decision1tree is performed1by directing from1the root hub1until arriving1at a leaf1hub. The1delineation gave1here is a1cannonical case in1information mining, including1the decision to1play or not1play in light1of atmosphere1conditions. For1this situation, standpoint1is in the1position of1the root1hub. The1degrees of1the hub1are quality1esteems. In1this case, the1child nodes1are trial1of mugginess1and blustery, prompting1the leaf1hubs http://www.iaeme.com/IJMET/index.asp 256 [email protected] Multiple Kernel Fuzzy Clustering For Uncertain Data Classification which1are the1real classification. This1illustration additionally1incorporates the relating1information, likewise1alluded to1as occasions. In1our case, there1are 9 "play"1days and15 "no play"1days. Decision1trees can1represent different1sorts1of information. The1least difficult1and most1well-known1is numerical1information. It1is regularly1attractive to compose1regular information1also. Nominal1quantities are formally1portrayed by1a discrete1arrangement of images. For1example, climate1can be1portrayed in1either numeric1or nominal1mold. We1can measure1the temperature1by saying1that it is 111degrees Celsius1or 52 degrees1Fahrenheit. We1could likewise1say that1it is icy, 1cool, 1mellow, 1warm or1hot. The1previous is1a case of1numeric information, and1the last1is a sort1of nominal1information. MoreVprecisely, the1case of chilly, 1cool, 1gentle, 1warm and1hot is1an extraordinary sort1of nominal1information, 1portrayed as ordinal1information. Ordinal1information has1an understood1supposition of1requested connections between1the qualities. Proceeding1with the climate1case, we could1likewise have1a simply nominal1depiction like1bright, cloudy1and stormy. These1qualities have1no connections1or separation1measures. The1kind of1information sorted1out by a1tree is essential1for understanding1how the tree1functions at1the hub1level. Reviewing1that every1hub is successfully1a test, numeric1information is1regularly assessed1in terms1of simple1mathematical inequality. For1example, numeric1climate information1could be tried1by finding1on the off1chance that1it is more prominent1than 101degrees Fahrenheit. Nominal information1is tried1in Boolean1design; as1it were, regardless1of whether1it has a1specific value. The representation1demonstrates the1two sorts1of tests. In1the climate1illustration, viewpoint1is an1nominal information1type. The test1essentially asks1which quality value1is represented1and routes1accordingly. The mugginess1hub reflects1numeric tests, with1a disparity of1less than1or equivalent1to 70, or1more prominent than170. Decision1tree enlistment1calculations1work recursively. Initial, 1a attribute1must be1chosen as1the root hub. With1a specific end1goal to make1the most proficient1 (i.e, littlest) 1tree, the1root hub1should successfully part1the information. Each1split attempts1to pare1down an1arrangement of1occurrences (the1genuine information) until1the point that1they all have1a1similar grouping. The1best split1is the one1that gives1what is named1the most1data gain. Data1in this setting1originates from1the idea1of entropy from1information theory, as1created by1Claude Shannon. Despite1the fact that1"data" has1numerous unique1situations, it1has a very1specific mathematical meaning1relating to1certainty in1decision making. In1a perfect1world, each1split in the1decision tree1ought to convey1us more1like a characterization. One1approach to conceptualize1this is to1see each1progression along1the tree as1removing randomness1or entropy. Data, communicated1as a numerical1amount, reflests1this. For instance, consider1an extremely1basic grouping1issue that requires1making a decision1tree to choose1yes or no based1on few1information. This1is precisely1the situation1visualized in1the decision1tree. Every characteristic1values will1have a specific1number of1yes or no1orders. On1the off chance1that there1are equivalent1quantities of1yeses and1no's, at1that point there1is a lot1of entropy1in that1value. In1this circumstance, data1achieves a1greatest. Alternately, 1if there1are just1yeses or1just no's the1data is1additionally zero. The1entropy is1low, and1the test1value is exceptionally1valuable for settling1on a decision. 1The formula for1calculating intermediate 1alues is as1follows: How1about we1separate this. 1Consider attempting to1compute the1data pick up1for three factors for1one attribute. The1attribute overall1has an aggregate of1nine yeses1and five no's. http://www.iaeme.com/IJCET/index.asp 257 [email protected] Nijaguna GS and Dr. Thippeswamy K The1main variable1has two yeses1and three no's. The1second has1four yeses and zero1no's. The last1has three yeses1and two no's. Our initial1step is to1calculate the1data for each1of the factors. Starting1with1the first, our1formula leads1us1to info([2,3]) being1equal1to -2/5 x log 2/5 - 3/5 x log 3/5. This1comes to 0.9711bits. Our1second variable is easy1to1calculate. It only1has1yeses, so1it has a1value1of 0 bits. The1final1variable is1just the reverse1of the first -- the1value is also10.971 bits. Having found1the information1for the1variables, we1need1to calculate1the information1for1the attribute1as a whole: 91yeses1and 5 no's. The1calculation1is info([9,5])= -9/14 x log 9/14 - 5/14 x log 5/14. This comes1to 0.940 bits. In1decision tree1induction, our1objective is to find1the overall1information gain. This1is1found by averaging1the information1value1of the attribute1values.In our1case, this is1equivalent to1finding the1information of all1the attributes1together. We1would use the1formula info([2,3],[4,0],[3,2]) = (5/14)x0.971+(4/14)x0+(5/14)x0.971. This comes1to 0.6931 bits. The1final step is1to calculate1the overall information1gain. Information1gain1is found1by subtracting the1information1value average by1the1raw total1information of1the1attribute. Mathematically,1we would calculate1gain = info([9,5]) - info([2,3],[4,0],[3,2]) = 0.940 - 0.693 = 0.247. The1decision tree acceptance calculation1will figure1this total1for each1property, and select1the one with1the most elevated1data pick up1as the root1hub, and1proceed with1the estimation recursively1until the point1when the information1is totally1grouped. This1approach is1one of the1principal strategies1utilized for1decision tree enlistment. It1has various conceivable1weaknesses. One1normal issue emerges1when a1characteristic has an1extensive number of1extraordinarily recognizing1values. A1case of this could1be standardized1savings numbers, 1or different sorts1of individual1recognizable proof1numbers. For1this situation, there1is a misleadingly1high decision value1to the data -1the ID classifies1every single individual, and1distorts the1calculation by1over fitting the1information. One1arrangement is1to utilize a1data pick up1proportion that1predispositions attributes1with vast quantities1of particular1values. 3.4. PRUNING: One1of1the questions1that arises in1a1decision tree1algorithm1is1the optimal1size of the final1tree. A tree1that is too1large risks overfitting1the1training data and1poorly generalizing1to new samples. A1small1tree might1not capture important1structural information about1the sample1space. However, it1is firm to1tell when1a tree algorithm1should1stop1because it is impossible1to tell if1the addition1of a single1extra node will1dramatically decrease1error. This1problem is known1as the horizon1effect. A common1strategy is to grow1the tree until1each node1contains a small1number of instances1then use pruning1to1remove nodes1that do not provide1additional information.[1]Pruning1should reduce the1size of1a learning1tree without1reducing predictive1accuracy as1measured1by1a test set1or using cross-validation. There1are many1techniques1for tree pruning1that differ in1the measurement that1is used to optimize1performance. 4. SYSTEM ARCHITECTURE http://www.iaeme.com/IJMET/index.asp 258 [email protected] Multiple Kernel Fuzzy Clustering For Uncertain Data Classification Figure 2 System Architecture 4.1. Modules Specification Modules: Processing1Dataset. Detecting1Split Points Decision1tree construction. Pruning1algorithm implementation. Performance1analysis. Processing Dataset: Selecting1the numerical dataset from1UCI repository. Loading the selected 1dataset into database. Classifying dataset1based on class attributes in the1dataset. Loading classified1dataset in the database1for calculation. Detecting Split Points: Calculating entropy1for all attributes. Calculating Info1and gain for all the attributes1in the dataset. Selecting the best split1points with maximum gain. Sorting the values1based on gain. Decision Tree1Construction: Sorting the attributes1with less probability. Classifying the1attributes based on its probability1values. Sorting the1attributes with highest mean1value. Classifying the1attributes based on its1mean values. Constructing the1decision tree with selected attributes. Pruning Algorithm1Implementation: Finding1set of end-points for tuples in1attribute Aj. Computing1Entropy for all end-points for1all k attributes. Finding the1pruning threshold . Finding best1split points on attribute1Aj. Pruning the1end-points for all k attributes. Performance Analysis: Calculating1the execution time for1various pruning algorithm. Finding the1number of entropy calculation1by implementing pruning1algorithm. http://www.iaeme.com/IJCET/index.asp 259 various [email protected] Nijaguna GS and Dr. Thippeswamy K Evaluating the1effectiveness of various1pruning algorithm. Evaluating the1effects of samples and width1for attribute. Performance1analysis by Comparing the results of various1pruning algorithm 4.2. Flow Chart 5. CONCLUSION We1have extended1the1model of1decision-tree classification1to accommodate1knowledge tuples1having numerical1attributes with1insecurity delineated1by discretional pdf’s. We've1changed classical call1tree building algorithms (based1on the structure of C4.5[3]) to create call1trees for1classifying such1data. We've found1by trial and error that1once appropriate pdf’s area unit1used, exploiting knowledge1uncertainty ends up in call trees1with remarkably higher1accuracies. We tend to thus1advocate that knowledge be1collected and hold on with the1pdf info intact. Performance1is a problem, though, owing1to the raised quantity of information1to be processed, similarly1because the additional difficult entropy1computations http://www.iaeme.com/IJMET/index.asp 260 [email protected] Multiple Kernel Fuzzy Clustering For Uncertain Data Classification concerned. Therefore,1we've devised1a1series of1pruning techniques1to enhance1tree construction1efficiency. Our1algorithms are by experimentation1verified to be extremely1effective. Their execution times1area unit of associate order1of magnitude such as1classical1algorithms. Some1of these1pruning technique1area unit generalizations1of analogous1techniques1for1handling1point1valued1knowledge.1Different1techniques, namely1pruning by1bounding and1end-point1sampling area unit1novel.Though1our1novel1techniques area unit1primarily designed1to handle1unsure knowledge, they're1additionally helpful for building1call trees exploitation1classical algorithms once there1area unit tremendous amounts1of knowledge tuples REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] M. Chau , R. Cheng, B. Kao, and J. Ng, “Uncertain data mining: An example in clustering location data,” in PAKDD, ser. Lecture Notes in Computer Science, vol. 3918. Singapore: Springer, 9–12 Apr. 2006, pp. 199–204. S. D. Lee, B. Kao, and R. Cheng, “Reducing UK-means to K-means,” in The 1st Workshop on Data Mining of Uncertain Data (DUNE), in conjunction with the 7th IEEE International Conference on Data Mining (ICDM K. Chui, B. Kao, and E. Hung, “Mining frequent itemsets from uncertain data,” in PAKDD, ser. Lecture Notes in Computer Science, vol. 4426. Nanjing, China: Springer, 22-25 May 2007, pp. 47–58.), Omaha, NE, USA, 28 Oct. 2007. L. Breiman, “Technical note: Some properties of splitting criteria,”Machine Learning, vol. 24, no. 1, pp. 41–47, 1996. A. Nierman and H. V. Jagadish, “ProTDB: Probabilistic data in XML,”in VLDB. Hong Kong, China: Morgan Kaufmann, 20-23 Aug. 2002,pp. 646–657. J. Chen and R. Cheng, “Efficient evaluation of imprecise locationdependentqueries,” in ICDE. Istanbul, Turkey: IEEE, 15-20 Apr. 2007,pp. 586–595. M. Chau, R. Cheng, B. Kao, and J. Ng, “Uncertain data mining: Anexample in clustering location data,” in PAKDD, ser. Lecture Notes inComputer Science, vol. 3918. Singapore: Springer, 9–12 Apr. 2006,pp. 199–204. R. Cheng, Y. Xia, S. Prabhakar, R. Shah, and J. S. Vitter, “Efficientindexing methods for probabilistic threshold queries over uncertaindata,” in VLDB. Toronto, Canada: Morgan Kaufmann, 31 Aug.–3 Sept.2004, pp. 876–887.TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 15 R. Cheng, D. V. Kalashnikov, and S. Prabhakar, “Querying imprecisedata in moving object environments,” IEEE Trans. Knowl. Data Eng.,vol. 16, no. 9, pp. 1112–1127, 2004. W. K. Ngai, B. Kao, C. K. Chui, R. Cheng, M. Chau, and K. Y. Yip,“Efficient clustering of uncertain data,” in ICDM. Hong Kong, China:IEEE Computer Society, 18–22 Dec. 2006, pp. 436–445. S. D. Lee, B. Kao, and R. Cheng, “Reducing UK-means to K-means,”in The 1st Workshop on Data Mining of Uncertain Data (DUNE), inconjunction with the 7th IEEE International Conference on Data Mining(ICDM), Omaha, NE, USA, 28 Oct. 2007. http://www.iaeme.com/IJCET/index.asp 261 [email protected]