An Empirical Model of Clustering Algorithm for High dimensional Data

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 17 Number 1 – Nov 2014
An Empirical Model of Clustering Algorithm for
High dimensional Data
Ponnana Janardhana Rao1, Ch. Ramesh2
1
1,2
Final M.Tech student, 2Professor
Department of Computer Science and Engineering, AITAM, Tekkali, Inida
Abstract: Feature set extraction from raw dataset is always
an interesting and important research issue, here useful
features extracted from set of features of dataset. In this
paper we are proposing an efficient clustering algorithm
i.e. fast clustering-based feature selection. The FAST
algorithm works in two steps. In the first step, features are
divided into clusters by using graph-theoretic clustering
methods. In the second step, the most representative feature
that is strongly related to target classes is selected from
each cluster to form a subset of features. Classification
technique applies over clustered data which is clustered by
constructing minimum spanning tree.
I. INTRODUCTION
In social network, nodes can be represented as
vertices and those vertices V (v1,v2…vn) connected through
set of edges E in a undirected graph G (V,E) and nonidentifying attribute to describe node is known as quasiidentifier. Clustering can be performed on the quasi
identifiers like age and gender, distributed clustering
groups similar type of objects based on minimum distance
between the nodes.
In distributed networks, data can be numerical
data or categorical .Numerical data can be compared with
respect to quasi identifier difference and Categorical data
can be compared with similarity between the data objects.
Text Clustering methods are divided into three types. They
are partitioning clustering, Hierarchal clustering, and fuzzy
clustering. In partitioning algorithm, randomly select k
objects and define them as k clusters. Then calculate cluster
centroids and make clusters as per the centroids. It
calculates the similarities between the text and the
centroids. It repeats this process until some criteria
specified by the user.
In this paper we are proposing architecture, every
data holder or player clusters the documents itself after
preprocessing and computes the local and global
frequencies of the documents for calculation of file Weight
or document weights. Distributed k means algorithm is one
of the efficient distributed clustering algorithm. In our
work we are working towards optimized clustering in
distributed networks by improving traditional clustering
algorithm. In our approach if a new dataset placed at data
holder, it requests the other data holders to forward the
ISSN: 2231-5381
relevant features from other data holders instead of entire
datasets to cluster the documents.
In distributed clustering nodes can be clustered
based on the common edges which are connected through
vertices and should have similar set of weights between
the edges, but weight does not leads to optimal solution
because it is not optimal measure to consider it.
Every individual data holder or player maintains
their transactions or patterns, in horizontal partitioning,
every data holder forwards their patterns to centralized
server after encryption of patterns which are at individual
end, At centralized server received pattern can be
decrypted with decoder and forwarded to Boolean Matrix
to extract frequent pattern from the received patterns. For
experimental purpose we establish connection between the
nodes and Central location (Key generation center) through
network or socket programming, Key can be generated by
using improved LaGrange’s polynomial equation and key
can be distributed to user [8].
Even though various horizontal and vertical partition
mechanisms available, privacy is the major concern while
transmission of data from data holders securely.
Centralized server performs required data mining
operations over received data.
II. RELATED WORK
Hierarchical clustering has been adopted
in word selection in the context of text classification.
Distributional clustering has been used to cluster words
into groups based either on their participation in particular
grammatical relations with other words by Pereira et al. or
on the distribution of class labels associated with each
word by Baker and McCallum . As distributional clustering
of words are agglomerative in nature, and result in
suboptimal word clusters and high computational cost,
Dhillon et al. Proposed a new information-theoretic
divisive algorithm for word clustering and applied it to text
classification
In distributed networks or open environments
nodes communicates with each other openly for
transmission of data, there is a rapid research going on
secure mining.Research work on privacy preserving
http://www.ijettjournal.org
Page 47
International Journal of Engineering Trends and Technology (IJETT) – Volume 17 Number 1 – Nov 2014
techniques while mining of data either in classification,
association rule mining or clustering.
Randomization and perturbation approach
available for privacy preserving process and it can be
maintained in two ways, one is cryptographic approach
here real data sets can be converted to unrealized datasets
by encoding the real datasets and the second one
imputation methods, here some fake values imputed
between there real dataset and extracted while mining with
some rules[1][2].
In distributed networks, data can be numerical
data or categorical .Numerical data can be compared with
respect to quasi identifier difference and Categorical data
can be compared with similarity between the data objects.
Text Clustering methods are divided into three types. They
are partitioning clustering, Hierarchal clustering, fuzzy
clustering. In partitioning algorithm, randomly select k
objects and define them as k clusters. Then calculate cluster
centroids and make clusters as per the centroids. It
calculates the similarities between the text and the
centroids. It repeats this process until some criteria
specified by the user.
RELIEF is a feature selection algorithm used in
binary classification (generalizable to polynomial
classification by decomposition into a number of binary
problems). Its strengths are that it is not dependent on
heuristics, requires only linear time in the number of given
features and training instances, and it is noise-tolerant and
robust to feature interactions, as well as being applicable
for binary or continuous data only ; however, it does not
discriminate between redundant features, and low numbers
of training instances fool the algorithm.
Relief-F is a feature selection strategy that
chooses instances randomly, and changed the weights of
the feature relevance based on the nearest neighbor. By its
merits, Relief-F is one of the most successful strategies in
feature selection.The generality of the selected features is
limited
and
the computational
complexity is
large.Accuracy is not guaranteed and ineffective at
removing redundant features
number of features, but also improves the performance
and accuracy. It efficiently and effectively deals with both
irrelevant and redundant features.Partitioning of the
MinimalSpanningTree(MST) into a forest with each tree
representing a cluster
Irrelevant Feature Removal:
In real time world datasets are raw datasets, it
consists of irrelevant, inconsistant or redundant features,
these obviously affects the accuracy of the results while
mining. To improve the accuracy and efficiency we need to
extract the irrelevant features, to find irrelevant features,
we initially prepare relevant feature set then obviously we
can identify the irrelevant features. The following are the
measures.
a definition of relevant features.
Suppose F to be the full set of features, Fi€ F be a
feature,Si=F—{Fi} and S I’ ⊆ Si. Let si‘ be a valueassignment of
all features in Si’, fi a value-assignment of feature Fi, and c
avalue-assignment of the target concept C. The definition
canbe formalized as follows.
Definition 1 (Relevant Feature).
Fi is relevant to the targetconcept C if and only if there
exists some si, fi, and c, suchthat, for p(S’I =s’I, Fi =fi)>0,
P(C=c | S’I =s’I, Fi =fi) ≠ p(C=c | S’I =s’i)
Otherwise, feature Fi is an irrelevant feature.
Definition 1 indicates that there are two kinds of
relevantfeatures due to different S’i: 1) when S’i= Si, from
thedefinition we can know that Fi is directly relevant to
thetarget concept; 2) when Si⊈Si, from the definition we
mayobtain that p(C|si,Fi)=p(C|Si). It seems that Fi is
irrelevantto the target concept. However, the definition
shows that feature Fi isrelevant when using S’i∪ {Fi} to
describe thetarget concept. The reason behind is that either
Fi isinteractive with Si or Fi is redundant with Si - Si. In
thiscase, we say Fi is indirectly relevant to the target
concept.Most of the information contained in redundant
featuresis already present in other features. As a result,
redundantfeatures do not contribute to getting better
interpretingability to the target concept.
III. PROPOSED SYSTEM
We are proposing an efficient and empirical model of
feature set extraction model, In this modelinitially
irrelevant features can be extracted based on relevant
feature=res and then constructs minimum spanning tree
based on weight of the edges and papplies fast clustering
algorithm and assign relevant features to its respective
subset.The result is a forest and each tree in the forest
represents a cluster. we adopt the minimum spanning tree
based clustering algorithms, because they do not assume
that data points are grouped around centers or separated by
a regular geometric curve and have been widely used in
practice.The proposed algorithm not only reduces the
ISSN: 2231-5381
Definition 2. (Markov Blanket).
Given a feature Fi€ F, letMi⊂F (Fi ∉Mi), Mi is
said to be a Markov blanket for Fi if and only if
P(F-Mi –{Fi} , C | Fi , Mi) = p(F-Mi – {Fi}, C|Mi).
Definition 3 (Redundant Feature).
Let S be a set of features, afeature in S is redundant if and
only if it has a Markov Blanket within S.
Minimum Spanning Tree: In a connected undirected graph
spanning tree is a sub graph, which is constructed with
edges and vertices or nodes. Every edge is assigned with
http://www.ijettjournal.org
Page 48
International Journal of Engineering Trends and Technology (IJETT) – Volume 17 Number 1 – Nov 2014
some weight, weight of the spanning tree can be calculated
based on sum of the weights. Minimum spanning tree can
be the graph which has minimum weight or cost.
To ensure the efficiency of FAST, we adopt the efficient
minimum-spanning tree (MST) clustering method. The
efficiency and effectiveness of the FAST algorithm are
evaluated through an empirical study. Extensive
experiments are carried out to compare FAST and several
representative feature selection algorithms. We construct a
Minimal spanning tree with weights. a MST, which
connects all vertices such that the sum of the weights of the
edges is the minimum, using the well-known Prim
algorithm.
Algorithm1.FAST
T-Relevance, F-correlation calculation:
T-Relevance between a feature and the target concept C,
the correlation F-Correlation between a pair of features, the
feature redundancy F-Redundancy and the representative
feature R-Feature of a feature cluster can be defined.
According to the above definitions, feature subset selection
can be the process that identifies and retains the strong TRelevance features and selects R-Features from feature
clusters. The behind heuristics are that
1. Irrelevant features have no/weak correlation with target
concept.
2. Redundant features are assembled in a cluster and a
representative feature can be taken out of the cluster.
Inputs:
Inputs:D(F1, F2,….., Fm, C) the given data set
relevance threshold.
SU(F0,C)=0.5
– the T-
F0
Output: S –selected feature subset.
0.3
0.4
SU(F0,C)=0.7
//= = = = part1: Irrelevant Feature Removal = = = =
1.
2.
3.
4.
For i=1 to m do
T-relevance = SU (Fi, C)
If T-Relevance> then
S = S ∪ {Fi}
F1
9.
G=NULL; // G is a complete graph
For each pair of features { Fi’ , Fj’} ⊂ S do
F-correlation =SU ( Fi’ , Fj’)
Add Fi’ and/or Fj’ to G with F-Correlation as
the weight of the corresponding edge;
minSpanTree = Prim (G); // Using Prim
Algorithm to generate the minimum spanning
Tree
//= = = =part 3: Tree partition and Representative Feature
selection = = = =
10. Forest=minSpanTree
11. For each edge Eij € Forest do
12.
If SU ( Fi’ , Fj’) < SU ( Fi’ , C) ^ SU ( Fi’ ,
Fj’) < SU ( Fi’ , C) then
13.
Forest = Forest -Eij
14. S = ∮
15. For each tree Ti € Forest do
16. FjR =argmaxF’k € Ti SU(F’ k , C)
17. S=SU {FjR};
18. Return S
ISSN: 2231-5381
0.6
0.8
0.9
//= = = =part2: Minimum spanning Tree Construction = =
==
5.
6.
7.
8.
F4
0.7
F3
F2
F6
SU(F0,C)=0.6
SU(F0,C)=0.2
F5
SU(F0,C)=0.5
SU(F0,C)=0.2
After tree partition unnecessary edges are removed. each
deletion results in two disconnected trees(T1,T2).After
removing all the unnecessary edges, a forest is obtained.
Each tree represents a cluster. Finally it comprises for final
feature subset then calculate the accurate/relevant feature
IV. CONCLUSION
We are concluding our current research work with
efficient fast clustering selection algorithm with efficient
and accurate results. Initially relevant features can be
extracted by applying the preprocessing technique and
constructs the minimum spanning tree for clusters of
feature set and then assigns the features to accurate
decision classes, these machine learning approaches places
the attribute values to final decision labels.
http://www.ijettjournal.org
Page 49
International Journal of Engineering Trends and Technology (IJETT) – Volume 17 Number 1 – Nov 2014
REFERENCES
[1] A. Agresti. An Introduction to Categorical Data Analysis.John Wiley,
New York, 1997.
[2]
J. Barth´elemy.
Remarquessur
les propri´et´esmetriques
desensemblesordonn´es.Math. Sci. hum., 61:39–60, 1978.
[3] J. Barth´elemy and B. Leclerc.Themedian procedure for partitions.In
Partitioning Data Sets, pages 3–34, Providence,1995. American
Mathematical Society.
[4] C. L. Blake and C. J. Merz. UCI Repository ofmachine learning
databases. University of California,Irvine, Dept. of Information and
Computer Sciences,http://www.ics.uci.edu/∼ mlearn/MLRepository.html,
1998.
[5] A. Blum and P. Langley.Selection of relevant features andexamples in
machine learning.Artificial Intelligence, pages245–271, 1997.
[6] M. Brown, W. Grundy, D. Lin, N. Cristiani, C. W. Sugnet,T. Furey,
M. Ares, and D. Haussler. Knowledge-based analysisof microarray gene
expression data by using supportvector machines.PNAS, 97:262–267,
2000.
[7] E. Guyon and A. Elisseeff.An introduction to variable andfeature
selection.J. of Machine Learning Research, pages1157–1182, 2003.
[8] M. A. Hall. Correlation-based feature selection for
machinelearning.PhD thesis, The University of Waikato, Hamilton,New
Zealand, 1999.
[9] B. Hanczar, M. Courtine, A. Benis, C. Hannegar,K. Clement, and J.
Zucker.Improving classification ofmicroarray data using prototype-based
feature selection.SIGKDD Explorations, pages 23–28, 2003.
[10] L. Kaufman and P. J. Rousseeuw.Finding Groups in Data – An
Introduction to Cluster Analysis. Wiley Interscience,New York, 1990.
BIOGRAPHIES
Ponnana janardhana rao completed his
B.Tech in Information Technology in sivani
institute of technology chilakapalem.He is
pursuing M.Tech in Information Technology
and Engineering, AITAM, Tekkali, Inida.
Ch. Ramesh received the B.Tech degree
from Nagarjuna University, Nagarjuna
Nagar, India and the M. Tech degree in
Remote Sensing from Andhra University,
Visakhapatnam, and another M. Tech
degree in Computer Science and
Engineering from JNTUH, Hyderabad. He is Submitted the
Ph. D Thesis in Computer Science and Engineering in the
University of JNTUH, Hyderabad. He is a Professor in the
Department of Computer Science and Engineering,
AITAM, Tekkali, Inida. His research interests include
Image Processing, Pattern Recognition, and Formal
Languages.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 50
Download