Uploaded by IAEME PUBLICATION

MULTIPLE KERNEL FUZZY CLUSTERING FOR UNCERTAIN DATA CLASSIFICATION

advertisement
International Journal of Computer Engineering & Technology (IJCET)
Volume 10, Issue 01, January-February 2019, pp. 253–261, Article ID: IJCET_10_01_026
Available online at http://www.iaeme.com/ijcet/issues.asp?JType=IJCET&VType=10&IType=1
Journal Impact Factor (2016): 9.3590(Calculated by GISI) www.jifactor.com
ISSN Print: 0976-6367 and ISSN Online: 0976–6375
© IAEME Publication
MULTIPLE KERNEL FUZZY CLUSTERING FOR
UNCERTAIN DATA CLASSIFICATION
Nijaguna GS
Research Scholar, VTU Belagaum
Dr. Thippeswamy K
Professor & Head, Dept. Of CS&E, VTU PG Centre, Regional Office Mysuru,
ABSTRACT:
Traditional1call tree classifiers work1with information whose values1area
unitcelebrated and precise. We1have a tendency to extend1such classifiers to handle
information1with unsureinfo. Worth uncertainty1arises in several applications
throughoutthe1informationassortmentmethod.Example1source
of
uncertainty
embrace1measurement/quantization errors, information staleness, and multiple
recurrent measurements. With1uncertainty, the worth of a knowledge item1is
commonlydepicted not by one1single worth, however by1multiple values forming a
probability distribution. Instead of1abstracting unsureinformation by applied1math
derivatives (such as1mean and median), we have a1tendency to discover that1the
accuracy of a call1tree classifier will bea lot1of improved if the
“complete1information” of a knowledge item is utilized. Since1process pdf’sis
computationall1 a lot of pricey than1process single values (e.g., averages), call
tree1construction on unsure information1is more electronic equipment1demanding
than that sure information. To1tackle this problem, we have a tendency1to propose a
series of pruning1techniques which will greatly improve1construction potency.
Key words: Embrace, Probability1distribution.
Cite this Article: Nijaguna GS and Dr. Thippeswamy K, Multiple Kernel Fuzzy
Clustering For Uncertain Data Classification, International Journal of Computer
Engineering and Technology, 10(01), 2019, pp. (253)-(261).
http://www.iaeme.com/IJCET/issues.asp?JTypeIJCET&VType=10&IType=1
1. INTRODUCTION
A. Data Mining: Data1mining is the extraction of1hidden predictive information from
large1databases, is a powerful new technology1with great potential to help companies focus1on
the most important information1in their data warehouses. Data1mining tools predict future
trends and1behaviors, allowing businesses to1make proactive, knowledge-driven1decisions.
The automated, prospective1analyses offered by data mining1move beyond the analyses1of
past events provided by retrospective tools1typical of decision support1systems. Data
http://www.iaeme.com/IJCET/index.asp
253
editor@iaeme.com
Nijaguna GS and Dr. Thippeswamy K
mining1tools can answer business questions1that traditionally were1too time consuming
to1resolve. They scour databases1for hidden patterns, finding1predictive information that
experts1may miss because it lies1outside their expectations. Most1companies already
collect1and refine massive quantities1of data. . The process1of applying a model to1new data
is known as1scoring. Many forms1of data mining are1predictive. For example, a model1might
predict income based on1education and other demographic1factors. Predictions have1an
associated probability. Prediction1probabilities are also known as1confidence. Other forms1of
data mining identify natural1groupings in the data. For1example, a model might1identify the
segment of the1population that has an income1within a specified range, that1has a good driving
record, and1that leases a new car on1a yearly basis. Data mining1can derive actionable
information from large1volumes of data. For example, a town1planner might use a model that
predicts1income based on1demographics to develop a plan for1low-income housing.
B. The Scope of Data Mining: Data1mining derives its name from the1similarities
between searching for valuable1business information in a large database1- for example, finding
linked1products in gigabytes of1store scanner data - and1mining a mountain for a vein of
valuable1ore. Both processes1require either sifting through an1immense amount of material,
or1Intelligently probing it to find1exactly where the value1resides. Given databases
of1sufficient size and quality, data1mining technology can1generate new business
opportunities1by providing these capabilities:
 Automated prediction1of trends and behaviors. Data1mining automates the
process of1finding predictive information in1large databases. Questions that
traditionally1required extensive hands-on1analysis can now be answered
directly1from the data quickly. A typical1example of a predictive problem is
targeted1marketing. Data mining uses1data on past promotional1mailings to
identify the targets most likely to1maximize return on investment1in future
mailings. Other1predictive problems include forecasting1bankruptcy and other
forms of1default, and identifying segments1of a population likely to respond
similarly to1given events.
 Automated1discovery of previously unknown patterns. Data mining1tools
sweep through databases and identify1previously hidden patterns in one1step. An
example of1pattern discovery is the analysis1of retail sales data to1identify
seemingly unrelated products that1are often purchased together.1Other pattern
discovery problems1include detecting fraudulent credit1card transactions and
identifying anomalous1data that could represent data entry1keying errors.
2. DECISION TREE
Decision1tree learning is a method commonly used1in data mining. The goal1is to create a
model that predicts1the value of a target variable1based on several input variables. An1example
is shown on the right. Each interior1node corresponds to one of the1input variables; there are
edges1to children for each of the possible1values of that input variable. Each1leaf represents a
value of the1target variable given the values of the1input variables represented by1the path
from the root to the leaf. A1tree can be "learned" by splitting1the source set into1subsets based
on an attribute value1test. This process is1repeated on each derived subset1in a recursive
manner called1recursive partitioning. The recursion is1completed when the subset at a1node
has all the same value1of the target variable, or when splitting1no longer adds value to
the1predictions. This process of top-down1induction of decision1trees (TDIDT) is an
example1of a greedy algorithm, and it1is by far the most common1strategy for learning
decision trees from1data, but it is not the only1strategy. In fact, some approaches1have been
developed recently1allowing tree induction to be1performed in a bottom-up fashion. In
http://www.iaeme.com/IJMET/index.asp
254
editor@iaeme.com
Multiple Kernel Fuzzy Clustering For Uncertain Data Classification
data1mining, decision trees can1be described also as the combination1of mathematical
and1computational techniques to aid1the description, categorization1and generalization of a
given set1of data.
Figure 1: Sample Decision Tree
3. ALGORITHM
3.1. C4.5 ALGORITHM
C4.51is a calculation1used to produce1a decision tree1created by Ross1Quinlan.[1] C4.5 is1an
augmentation of1Quinlan's prior ID31calculation. The decision1trees created by C4.51can be
utilized1for characterization, and1therefore, C4.5 is1regularly alluded to as1a factual classifier.
C4.51constructs decision1trees from an1arrangement of preparing1information similarly as
ID3, utilizing1the idea of data1entropy. The preparation1information is an1arrangement of
effectively ordered1examples. Each1sample comprises1of a p-dimensional1vector ,
where1they speak to1traits or highlights1of the example, and1also the class in1which falls.
At1every hub of the1tree, C4.5 picks1the property of the1information that
most1successfully parts its1arrangement of tests1into subsets enhanced1in one class or1the
other. The part1rule is the standardized1data pick1up (contrast in1entropy). The
characteristic1with the most1noteworthy standardized1data pick up is1settled on the1decision.
The1C4.5 calculation then1recurses on the littler1sublists.
This1calculation has a1couple of base1cases.
 All1the samples in1the list belong to1the same class. At the1point when
this1happens, it1basically makes1a leaf hub1for the decision1tree saying to1pick
that class.
 None1of the features1give any data1pick up. For1this situation, C4.51makes a
decision1hub higher up1the tree utilizing1the normal estimation1of the class.
 Instance1of1previously1unseen1class1encountered. Once more, 1C4.5 makes
a1decision hub higher1up the tree utilizing1the expected1value.
http://www.iaeme.com/IJCET/index.asp
255
editor@iaeme.com
Nijaguna GS and Dr. Thippeswamy K
3.2. ENTROPY
Entropy1is a measure of unpredictability or information1content. To get an1informal, intuitive
understanding of1the connection between these1three English terms, consider the1example of
a poll on some political1issue. Usually, such polls1happen because the outcome of the1poll
isn't already known. In1other words, the outcome of1the poll is relatively unpredictable, and
actually1performing the poll and1learning the results gives some new1information; these are
just1different ways of saying1that the entropy of the poll1results is large. Now consider1the
case that the same poll1is performed a second1time shortly after the first1poll. Since the1result
of the first poll is already1known, the outcome of the1second poll can be predicted well1and
the results should not1contain any new information; in1this case the entropy1of the second poll
results is1small. Now consider the example1of a coin toss. When the1coin is fair, that is, when
the probability1of heads is the same as the1probability of tails, then1the entropy of the
coin1toss is as high as it1could be. This is1because there is no way to predict1the outcome of
the coin toss1ahead of time - the best we1can do is predict that the1coin will come up heads,
and1our prediction will be correct1with probability 1/2. Such1a coin toss has one bit of1entropy
since there1are two possible outcomes that1occur with equal probability, and learning1the
actual outcome contains one1bit of information. Contrarily, a1coin toss with a coin that1has
two heads and no tails1has zero entropy since1the coin will always come1up heads, and the
outcome1can be predicted perfectly. Most1collections of data in the1real world lie somewhere
in1between.
English1text has fairly low entropy. In1other words, it is fairly1predictable. Even if
we1don't know exactly what1is going to come next, we1can be fairly certain that, for1example,
there will be1many more e's than z's, or1that the combination1'qu' will be much more
common1than any other combination1with a 'q' in it and the1combination 'th' will be1more
common than 'z',1'q', or 'qu'. Uncompressed, English1text has about one bit1of entropy for
each1character (commonly encoded as1eight bits) of message.
If a compression1scheme is lossless—that1is, you can always recover1the entire original
message1by decompressing—then a1compressed message has the same1quantity of
information as the1original, but communicated in1fewer bits. That is, it has1more information
per bit, or1a higher entropy. This means1a compressed message1is more unpredictable, because
there1is no redundancy; each bit1of the message is communicating a1unique bit of information.
Roughly speaking,1Shannon's source coding theorem1says that a lossless compression1scheme
cannot compress messages, on1average, to have more than1one bit of information per bit
of1message. The entropy1of a message multiplied1by the length of that1message is a measure
of how1much information the1message contains.
3.3. GAIN RATIO:
Data1delivered by1information mining strategies1can be represented1in a wide range1of ways.
Decision1tree structures1are a typical1approach1to sort out arrangement1plans. In1ordering
assignments, decision1trees visualize1what steps are taken1to arrive at a1characterization.
Each1decision tree1starts with what1is named a1root hub, thought1to be the1"parent" of
each1other hub. Every1hub in the1tree assesses1a characteristic in1the information1and figures
out1which way1it ought to1take after. Normally, the1decision test depends1on contrasting1an
value against1some constant. Characterization1utilizing a decision1tree is performed1by
directing from1the root hub1until arriving1at a leaf1hub.
The1delineation gave1here is a1cannonical case in1information mining, including1the
decision to1play or not1play in light1of atmosphere1conditions. For1this situation,
standpoint1is in the1position of1the root1hub. The1degrees of1the hub1are quality1esteems.
In1this case, the1child nodes1are trial1of mugginess1and blustery, prompting1the leaf1hubs
http://www.iaeme.com/IJMET/index.asp
256
editor@iaeme.com
Multiple Kernel Fuzzy Clustering For Uncertain Data Classification
which1are the1real classification. This1illustration additionally1incorporates the
relating1information, likewise1alluded to1as occasions. In1our case, there1are 9 "play"1days
and15 "no play"1days.
Decision1trees can1represent different1sorts1of information. The1least difficult1and
most1well-known1is numerical1information. It1is regularly1attractive to compose1regular
information1also. Nominal1quantities are formally1portrayed by1a discrete1arrangement of
images. For1example, climate1can be1portrayed in1either numeric1or nominal1mold. We1can
measure1the temperature1by saying1that it is 111degrees Celsius1or 52 degrees1Fahrenheit.
We1could likewise1say that1it is icy, 1cool, 1mellow, 1warm or1hot. The1previous is1a case
of1numeric information, and1the last1is a sort1of nominal1information. MoreVprecisely,
the1case of chilly, 1cool, 1gentle, 1warm and1hot is1an extraordinary sort1of
nominal1information, 1portrayed as ordinal1information. Ordinal1information has1an
understood1supposition of1requested connections between1the qualities. Proceeding1with the
climate1case, we could1likewise have1a simply nominal1depiction like1bright, cloudy1and
stormy. These1qualities have1no connections1or separation1measures.
The1kind of1information sorted1out by a1tree is essential1for understanding1how the
tree1functions at1the hub1level. Reviewing1that every1hub is successfully1a test,
numeric1information is1regularly assessed1in terms1of simple1mathematical inequality.
For1example, numeric1climate information1could be tried1by finding1on the off1chance
that1it is more prominent1than 101degrees Fahrenheit. Nominal information1is tried1in
Boolean1design; as1it were, regardless1of whether1it has a1specific value. The
representation1demonstrates the1two sorts1of tests. In1the climate1illustration, viewpoint1is
an1nominal information1type. The test1essentially asks1which quality value1is
represented1and routes1accordingly. The mugginess1hub reflects1numeric tests, with1a
disparity of1less than1or equivalent1to 70, or1more prominent than170.
Decision1tree enlistment1calculations1work recursively. Initial, 1a attribute1must
be1chosen as1the root hub. With1a specific end1goal to make1the most proficient1 (i.e, littlest)
1tree, the1root hub1should successfully part1the information. Each1split attempts1to
pare1down an1arrangement of1occurrences (the1genuine information) until1the point
that1they all have1a1similar grouping. The1best split1is the one1that gives1what is named1the
most1data gain.
Data1in this setting1originates from1the idea1of entropy from1information theory,
as1created by1Claude Shannon. Despite1the fact that1"data" has1numerous unique1situations,
it1has a very1specific mathematical meaning1relating to1certainty in1decision making. In1a
perfect1world, each1split in the1decision tree1ought to convey1us more1like a
characterization. One1approach to conceptualize1this is to1see each1progression along1the
tree as1removing randomness1or entropy. Data, communicated1as a numerical1amount,
reflests1this. For instance, consider1an extremely1basic grouping1issue that requires1making
a decision1tree to choose1yes or no based1on few1information. This1is precisely1the
situation1visualized in1the decision1tree. Every characteristic1values will1have a
specific1number of1yes or no1orders. On1the off chance1that there1are equivalent1quantities
of1yeses and1no's, at1that point there1is a lot1of entropy1in that1value. In1this circumstance,
data1achieves a1greatest. Alternately, 1if there1are just1yeses or1just no's the1data
is1additionally zero. The1entropy is1low, and1the test1value is exceptionally1valuable for
settling1on a decision. 1The formula for1calculating intermediate 1alues is as1follows:
How1about we1separate this. 1Consider attempting to1compute the1data pick up1for three
factors for1one attribute. The1attribute overall1has an aggregate of1nine yeses1and five no's.
http://www.iaeme.com/IJCET/index.asp
257
editor@iaeme.com
Nijaguna GS and Dr. Thippeswamy K
The1main variable1has two yeses1and three no's. The1second has1four yeses and zero1no's.
The last1has three yeses1and two no's. Our initial1step is to1calculate the1data for each1of the
factors.
Starting1with1the first, our1formula leads1us1to info([2,3]) being1equal1to -2/5 x log 2/5
- 3/5 x log 3/5. This1comes to 0.9711bits. Our1second variable is easy1to1calculate. It
only1has1yeses, so1it has a1value1of 0 bits. The1final1variable is1just the reverse1of the first
-- the1value is also10.971 bits. Having found1the information1for the1variables, we1need1to
calculate1the information1for1the attribute1as a whole: 91yeses1and 5 no's. The1calculation1is
info([9,5])= -9/14 x log 9/14 - 5/14 x log 5/14. This comes1to 0.940 bits.
In1decision tree1induction, our1objective is to find1the overall1information gain.
This1is1found by averaging1the information1value1of the attribute1values.In our1case, this
is1equivalent to1finding the1information of all1the attributes1together.
We1would use the1formula
info([2,3],[4,0],[3,2]) = (5/14)x0.971+(4/14)x0+(5/14)x0.971. This comes1to 0.6931 bits.
The1final step is1to calculate1the overall information1gain. Information1gain1is found1by
subtracting the1information1value average by1the1raw total1information of1the1attribute.
Mathematically,1we would calculate1gain = info([9,5]) - info([2,3],[4,0],[3,2]) = 0.940 - 0.693
= 0.247. The1decision tree acceptance calculation1will figure1this total1for each1property, and
select1the one with1the most elevated1data pick up1as the root1hub, and1proceed with1the
estimation recursively1until the point1when the information1is totally1grouped.
This1approach is1one of the1principal strategies1utilized for1decision tree enlistment. It1has
various conceivable1weaknesses. One1normal issue emerges1when a1characteristic has
an1extensive number of1extraordinarily recognizing1values. A1case of this could1be
standardized1savings numbers, 1or different sorts1of individual1recognizable proof1numbers.
For1this situation, there1is a misleadingly1high decision value1to the data -1the ID
classifies1every single individual, and1distorts the1calculation by1over fitting
the1information. One1arrangement is1to utilize a1data pick up1proportion that1predispositions
attributes1with vast quantities1of particular1values.
3.4. PRUNING:
One1of1the questions1that arises in1a1decision tree1algorithm1is1the optimal1size of the
final1tree. A tree1that is too1large risks overfitting1the1training data and1poorly
generalizing1to new samples. A1small1tree might1not capture important1structural
information about1the sample1space. However, it1is firm to1tell when1a tree
algorithm1should1stop1because it is impossible1to tell if1the addition1of a single1extra node
will1dramatically decrease1error. This1problem is known1as the horizon1effect. A
common1strategy is to grow1the tree until1each node1contains a small1number of
instances1then use pruning1to1remove nodes1that do not provide1additional
information.[1]Pruning1should reduce the1size of1a learning1tree without1reducing
predictive1accuracy as1measured1by1a test set1or using cross-validation. There1are
many1techniques1for tree pruning1that differ in1the measurement that1is used to
optimize1performance.
4. SYSTEM ARCHITECTURE
http://www.iaeme.com/IJMET/index.asp
258
editor@iaeme.com
Multiple Kernel Fuzzy Clustering For Uncertain Data Classification
Figure 2 System Architecture
4.1. Modules Specification
Modules:
 Processing1Dataset.
 Detecting1Split Points
 Decision1tree construction.
 Pruning1algorithm implementation.
 Performance1analysis.
Processing Dataset:
 Selecting1the numerical dataset from1UCI repository.
 Loading the selected 1dataset into database.
 Classifying dataset1based on class attributes in the1dataset.
 Loading classified1dataset in the database1for calculation.
Detecting Split Points:




Calculating entropy1for all attributes.
Calculating Info1and gain for all the attributes1in the dataset.
Selecting the best split1points with maximum gain.
Sorting the values1based on gain.
Decision Tree1Construction:
 Sorting the attributes1with less probability.
 Classifying the1attributes based on its probability1values.
 Sorting the1attributes with highest mean1value.
 Classifying the1attributes based on its1mean values.
 Constructing the1decision tree with selected attributes.
Pruning Algorithm1Implementation:
 Finding1set of end-points for tuples in1attribute Aj.
 Computing1Entropy for all end-points for1all k attributes.
 Finding the1pruning threshold .
 Finding best1split points on attribute1Aj.
 Pruning the1end-points for all k attributes.
Performance Analysis:
 Calculating1the execution time for1various pruning algorithm.
 Finding the1number of entropy calculation1by implementing
pruning1algorithm.
http://www.iaeme.com/IJCET/index.asp
259
various
editor@iaeme.com
Nijaguna GS and Dr. Thippeswamy K
 Evaluating the1effectiveness of various1pruning algorithm.
 Evaluating the1effects of samples and width1for attribute.
 Performance1analysis by Comparing the results of various1pruning algorithm
4.2. Flow Chart
5. CONCLUSION
We1have extended1the1model of1decision-tree classification1to accommodate1knowledge
tuples1having numerical1attributes with1insecurity delineated1by discretional pdf’s.
We've1changed classical call1tree building algorithms (based1on the structure of C4.5[3]) to
create call1trees for1classifying such1data. We've found1by trial and error that1once
appropriate pdf’s area unit1used, exploiting knowledge1uncertainty ends up in call trees1with
remarkably higher1accuracies. We tend to thus1advocate that knowledge be1collected and hold
on with the1pdf info intact. Performance1is a problem, though, owing1to the raised quantity of
information1to be processed, similarly1because the additional difficult entropy1computations
http://www.iaeme.com/IJMET/index.asp
260
editor@iaeme.com
Multiple Kernel Fuzzy Clustering For Uncertain Data Classification
concerned. Therefore,1we've devised1a1series of1pruning techniques1to enhance1tree
construction1efficiency. Our1algorithms are by experimentation1verified to be
extremely1effective. Their execution times1area unit of associate order1of magnitude such
as1classical1algorithms. Some1of these1pruning technique1area unit generalizations1of
analogous1techniques1for1handling1point1valued1knowledge.1Different1techniques,
namely1pruning
by1bounding
and1end-point1sampling
area
unit1novel.Though1our1novel1techniques area unit1primarily designed1to handle1unsure
knowledge, they're1additionally helpful for building1call trees exploitation1classical
algorithms once there1area unit tremendous amounts1of knowledge tuples
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
M. Chau , R. Cheng, B. Kao, and J. Ng, “Uncertain data mining: An example in clustering
location data,” in PAKDD, ser. Lecture Notes in Computer Science, vol. 3918. Singapore:
Springer, 9–12 Apr. 2006, pp. 199–204.
S. D. Lee, B. Kao, and R. Cheng, “Reducing UK-means to K-means,” in The 1st Workshop
on Data Mining of Uncertain Data (DUNE), in conjunction with the 7th IEEE International
Conference on Data Mining (ICDM
K. Chui, B. Kao, and E. Hung, “Mining frequent itemsets from uncertain data,” in PAKDD,
ser. Lecture Notes in Computer Science, vol. 4426. Nanjing, China: Springer, 22-25 May
2007, pp. 47–58.), Omaha, NE, USA, 28 Oct. 2007.
L. Breiman, “Technical note: Some properties of splitting criteria,”Machine Learning, vol.
24, no. 1, pp. 41–47, 1996.
A. Nierman and H. V. Jagadish, “ProTDB: Probabilistic data in XML,”in VLDB. Hong
Kong, China: Morgan Kaufmann, 20-23 Aug. 2002,pp. 646–657.
J. Chen and R. Cheng, “Efficient evaluation of imprecise locationdependentqueries,” in
ICDE. Istanbul, Turkey: IEEE, 15-20 Apr. 2007,pp. 586–595.
M. Chau, R. Cheng, B. Kao, and J. Ng, “Uncertain data mining: Anexample in clustering
location data,” in PAKDD, ser. Lecture Notes inComputer Science, vol. 3918. Singapore:
Springer, 9–12 Apr. 2006,pp. 199–204.
R. Cheng, Y. Xia, S. Prabhakar, R. Shah, and J. S. Vitter, “Efficientindexing methods for
probabilistic threshold queries over uncertaindata,” in VLDB. Toronto, Canada: Morgan
Kaufmann, 31 Aug.–3 Sept.2004, pp. 876–887.TRANSACTIONS ON KNOWLEDGE
AND DATA ENGINEERING 15
R. Cheng, D. V. Kalashnikov, and S. Prabhakar, “Querying imprecisedata in moving object
environments,” IEEE Trans. Knowl. Data Eng.,vol. 16, no. 9, pp. 1112–1127, 2004.
W. K. Ngai, B. Kao, C. K. Chui, R. Cheng, M. Chau, and K. Y. Yip,“Efficient clustering of
uncertain data,” in ICDM. Hong Kong, China:IEEE Computer Society, 18–22 Dec. 2006,
pp. 436–445.
S. D. Lee, B. Kao, and R. Cheng, “Reducing UK-means to K-means,”in The 1st Workshop
on Data Mining of Uncertain Data (DUNE), inconjunction with the 7th IEEE International
Conference on Data Mining(ICDM), Omaha, NE, USA, 28 Oct. 2007.
http://www.iaeme.com/IJCET/index.asp
261
editor@iaeme.com
Download