International Journal of Engineering Trends and Technology (IJETT) – Volume 17 Number 1 – Nov 2014 An Empirical Model of Clustering Algorithm for High dimensional Data Ponnana Janardhana Rao1, Ch. Ramesh2 1 1,2 Final M.Tech student, 2Professor Department of Computer Science and Engineering, AITAM, Tekkali, Inida Abstract: Feature set extraction from raw dataset is always an interesting and important research issue, here useful features extracted from set of features of dataset. In this paper we are proposing an efficient clustering algorithm i.e. fast clustering-based feature selection. The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graph-theoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form a subset of features. Classification technique applies over clustered data which is clustered by constructing minimum spanning tree. I. INTRODUCTION In social network, nodes can be represented as vertices and those vertices V (v1,v2…vn) connected through set of edges E in a undirected graph G (V,E) and nonidentifying attribute to describe node is known as quasiidentifier. Clustering can be performed on the quasi identifiers like age and gender, distributed clustering groups similar type of objects based on minimum distance between the nodes. In distributed networks, data can be numerical data or categorical .Numerical data can be compared with respect to quasi identifier difference and Categorical data can be compared with similarity between the data objects. Text Clustering methods are divided into three types. They are partitioning clustering, Hierarchal clustering, and fuzzy clustering. In partitioning algorithm, randomly select k objects and define them as k clusters. Then calculate cluster centroids and make clusters as per the centroids. It calculates the similarities between the text and the centroids. It repeats this process until some criteria specified by the user. In this paper we are proposing architecture, every data holder or player clusters the documents itself after preprocessing and computes the local and global frequencies of the documents for calculation of file Weight or document weights. Distributed k means algorithm is one of the efficient distributed clustering algorithm. In our work we are working towards optimized clustering in distributed networks by improving traditional clustering algorithm. In our approach if a new dataset placed at data holder, it requests the other data holders to forward the ISSN: 2231-5381 relevant features from other data holders instead of entire datasets to cluster the documents. In distributed clustering nodes can be clustered based on the common edges which are connected through vertices and should have similar set of weights between the edges, but weight does not leads to optimal solution because it is not optimal measure to consider it. Every individual data holder or player maintains their transactions or patterns, in horizontal partitioning, every data holder forwards their patterns to centralized server after encryption of patterns which are at individual end, At centralized server received pattern can be decrypted with decoder and forwarded to Boolean Matrix to extract frequent pattern from the received patterns. For experimental purpose we establish connection between the nodes and Central location (Key generation center) through network or socket programming, Key can be generated by using improved LaGrange’s polynomial equation and key can be distributed to user [8]. Even though various horizontal and vertical partition mechanisms available, privacy is the major concern while transmission of data from data holders securely. Centralized server performs required data mining operations over received data. II. RELATED WORK Hierarchical clustering has been adopted in word selection in the context of text classification. Distributional clustering has been used to cluster words into groups based either on their participation in particular grammatical relations with other words by Pereira et al. or on the distribution of class labels associated with each word by Baker and McCallum . As distributional clustering of words are agglomerative in nature, and result in suboptimal word clusters and high computational cost, Dhillon et al. Proposed a new information-theoretic divisive algorithm for word clustering and applied it to text classification In distributed networks or open environments nodes communicates with each other openly for transmission of data, there is a rapid research going on secure mining.Research work on privacy preserving http://www.ijettjournal.org Page 47 International Journal of Engineering Trends and Technology (IJETT) – Volume 17 Number 1 – Nov 2014 techniques while mining of data either in classification, association rule mining or clustering. Randomization and perturbation approach available for privacy preserving process and it can be maintained in two ways, one is cryptographic approach here real data sets can be converted to unrealized datasets by encoding the real datasets and the second one imputation methods, here some fake values imputed between there real dataset and extracted while mining with some rules[1][2]. In distributed networks, data can be numerical data or categorical .Numerical data can be compared with respect to quasi identifier difference and Categorical data can be compared with similarity between the data objects. Text Clustering methods are divided into three types. They are partitioning clustering, Hierarchal clustering, fuzzy clustering. In partitioning algorithm, randomly select k objects and define them as k clusters. Then calculate cluster centroids and make clusters as per the centroids. It calculates the similarities between the text and the centroids. It repeats this process until some criteria specified by the user. RELIEF is a feature selection algorithm used in binary classification (generalizable to polynomial classification by decomposition into a number of binary problems). Its strengths are that it is not dependent on heuristics, requires only linear time in the number of given features and training instances, and it is noise-tolerant and robust to feature interactions, as well as being applicable for binary or continuous data only ; however, it does not discriminate between redundant features, and low numbers of training instances fool the algorithm. Relief-F is a feature selection strategy that chooses instances randomly, and changed the weights of the feature relevance based on the nearest neighbor. By its merits, Relief-F is one of the most successful strategies in feature selection.The generality of the selected features is limited and the computational complexity is large.Accuracy is not guaranteed and ineffective at removing redundant features number of features, but also improves the performance and accuracy. It efficiently and effectively deals with both irrelevant and redundant features.Partitioning of the MinimalSpanningTree(MST) into a forest with each tree representing a cluster Irrelevant Feature Removal: In real time world datasets are raw datasets, it consists of irrelevant, inconsistant or redundant features, these obviously affects the accuracy of the results while mining. To improve the accuracy and efficiency we need to extract the irrelevant features, to find irrelevant features, we initially prepare relevant feature set then obviously we can identify the irrelevant features. The following are the measures. a definition of relevant features. Suppose F to be the full set of features, Fi€ F be a feature,Si=F—{Fi} and S I’ ⊆ Si. Let si‘ be a valueassignment of all features in Si’, fi a value-assignment of feature Fi, and c avalue-assignment of the target concept C. The definition canbe formalized as follows. Definition 1 (Relevant Feature). Fi is relevant to the targetconcept C if and only if there exists some si, fi, and c, suchthat, for p(S’I =s’I, Fi =fi)>0, P(C=c | S’I =s’I, Fi =fi) ≠ p(C=c | S’I =s’i) Otherwise, feature Fi is an irrelevant feature. Definition 1 indicates that there are two kinds of relevantfeatures due to different S’i: 1) when S’i= Si, from thedefinition we can know that Fi is directly relevant to thetarget concept; 2) when Si⊈Si, from the definition we mayobtain that p(C|si,Fi)=p(C|Si). It seems that Fi is irrelevantto the target concept. However, the definition shows that feature Fi isrelevant when using S’i∪ {Fi} to describe thetarget concept. The reason behind is that either Fi isinteractive with Si or Fi is redundant with Si - Si. In thiscase, we say Fi is indirectly relevant to the target concept.Most of the information contained in redundant featuresis already present in other features. As a result, redundantfeatures do not contribute to getting better interpretingability to the target concept. III. PROPOSED SYSTEM We are proposing an efficient and empirical model of feature set extraction model, In this modelinitially irrelevant features can be extracted based on relevant feature=res and then constructs minimum spanning tree based on weight of the edges and papplies fast clustering algorithm and assign relevant features to its respective subset.The result is a forest and each tree in the forest represents a cluster. we adopt the minimum spanning tree based clustering algorithms, because they do not assume that data points are grouped around centers or separated by a regular geometric curve and have been widely used in practice.The proposed algorithm not only reduces the ISSN: 2231-5381 Definition 2. (Markov Blanket). Given a feature Fi€ F, letMi⊂F (Fi ∉Mi), Mi is said to be a Markov blanket for Fi if and only if P(F-Mi –{Fi} , C | Fi , Mi) = p(F-Mi – {Fi}, C|Mi). Definition 3 (Redundant Feature). Let S be a set of features, afeature in S is redundant if and only if it has a Markov Blanket within S. Minimum Spanning Tree: In a connected undirected graph spanning tree is a sub graph, which is constructed with edges and vertices or nodes. Every edge is assigned with http://www.ijettjournal.org Page 48 International Journal of Engineering Trends and Technology (IJETT) – Volume 17 Number 1 – Nov 2014 some weight, weight of the spanning tree can be calculated based on sum of the weights. Minimum spanning tree can be the graph which has minimum weight or cost. To ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree (MST) clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical study. Extensive experiments are carried out to compare FAST and several representative feature selection algorithms. We construct a Minimal spanning tree with weights. a MST, which connects all vertices such that the sum of the weights of the edges is the minimum, using the well-known Prim algorithm. Algorithm1.FAST T-Relevance, F-correlation calculation: T-Relevance between a feature and the target concept C, the correlation F-Correlation between a pair of features, the feature redundancy F-Redundancy and the representative feature R-Feature of a feature cluster can be defined. According to the above definitions, feature subset selection can be the process that identifies and retains the strong TRelevance features and selects R-Features from feature clusters. The behind heuristics are that 1. Irrelevant features have no/weak correlation with target concept. 2. Redundant features are assembled in a cluster and a representative feature can be taken out of the cluster. Inputs: Inputs:D(F1, F2,….., Fm, C) the given data set relevance threshold. SU(F0,C)=0.5 – the T- F0 Output: S –selected feature subset. 0.3 0.4 SU(F0,C)=0.7 //= = = = part1: Irrelevant Feature Removal = = = = 1. 2. 3. 4. For i=1 to m do T-relevance = SU (Fi, C) If T-Relevance> then S = S ∪ {Fi} F1 9. G=NULL; // G is a complete graph For each pair of features { Fi’ , Fj’} ⊂ S do F-correlation =SU ( Fi’ , Fj’) Add Fi’ and/or Fj’ to G with F-Correlation as the weight of the corresponding edge; minSpanTree = Prim (G); // Using Prim Algorithm to generate the minimum spanning Tree //= = = =part 3: Tree partition and Representative Feature selection = = = = 10. Forest=minSpanTree 11. For each edge Eij € Forest do 12. If SU ( Fi’ , Fj’) < SU ( Fi’ , C) ^ SU ( Fi’ , Fj’) < SU ( Fi’ , C) then 13. Forest = Forest -Eij 14. S = ∮ 15. For each tree Ti € Forest do 16. FjR =argmaxF’k € Ti SU(F’ k , C) 17. S=SU {FjR}; 18. Return S ISSN: 2231-5381 0.6 0.8 0.9 //= = = =part2: Minimum spanning Tree Construction = = == 5. 6. 7. 8. F4 0.7 F3 F2 F6 SU(F0,C)=0.6 SU(F0,C)=0.2 F5 SU(F0,C)=0.5 SU(F0,C)=0.2 After tree partition unnecessary edges are removed. each deletion results in two disconnected trees(T1,T2).After removing all the unnecessary edges, a forest is obtained. Each tree represents a cluster. Finally it comprises for final feature subset then calculate the accurate/relevant feature IV. CONCLUSION We are concluding our current research work with efficient fast clustering selection algorithm with efficient and accurate results. Initially relevant features can be extracted by applying the preprocessing technique and constructs the minimum spanning tree for clusters of feature set and then assigns the features to accurate decision classes, these machine learning approaches places the attribute values to final decision labels. http://www.ijettjournal.org Page 49 International Journal of Engineering Trends and Technology (IJETT) – Volume 17 Number 1 – Nov 2014 REFERENCES [1] A. Agresti. An Introduction to Categorical Data Analysis.John Wiley, New York, 1997. [2] J. Barth´elemy. Remarquessur les propri´et´esmetriques desensemblesordonn´es.Math. Sci. hum., 61:39–60, 1978. [3] J. Barth´elemy and B. Leclerc.Themedian procedure for partitions.In Partitioning Data Sets, pages 3–34, Providence,1995. American Mathematical Society. [4] C. L. Blake and C. J. Merz. UCI Repository ofmachine learning databases. University of California,Irvine, Dept. of Information and Computer Sciences,http://www.ics.uci.edu/∼ mlearn/MLRepository.html, 1998. [5] A. Blum and P. Langley.Selection of relevant features andexamples in machine learning.Artificial Intelligence, pages245–271, 1997. [6] M. Brown, W. Grundy, D. Lin, N. Cristiani, C. W. Sugnet,T. Furey, M. Ares, and D. Haussler. Knowledge-based analysisof microarray gene expression data by using supportvector machines.PNAS, 97:262–267, 2000. [7] E. Guyon and A. Elisseeff.An introduction to variable andfeature selection.J. of Machine Learning Research, pages1157–1182, 2003. [8] M. A. Hall. Correlation-based feature selection for machinelearning.PhD thesis, The University of Waikato, Hamilton,New Zealand, 1999. [9] B. Hanczar, M. Courtine, A. Benis, C. Hannegar,K. Clement, and J. Zucker.Improving classification ofmicroarray data using prototype-based feature selection.SIGKDD Explorations, pages 23–28, 2003. [10] L. Kaufman and P. J. Rousseeuw.Finding Groups in Data – An Introduction to Cluster Analysis. Wiley Interscience,New York, 1990. BIOGRAPHIES Ponnana janardhana rao completed his B.Tech in Information Technology in sivani institute of technology chilakapalem.He is pursuing M.Tech in Information Technology and Engineering, AITAM, Tekkali, Inida. Ch. Ramesh received the B.Tech degree from Nagarjuna University, Nagarjuna Nagar, India and the M. Tech degree in Remote Sensing from Andhra University, Visakhapatnam, and another M. Tech degree in Computer Science and Engineering from JNTUH, Hyderabad. He is Submitted the Ph. D Thesis in Computer Science and Engineering in the University of JNTUH, Hyderabad. He is a Professor in the Department of Computer Science and Engineering, AITAM, Tekkali, Inida. His research interests include Image Processing, Pattern Recognition, and Formal Languages. ISSN: 2231-5381 http://www.ijettjournal.org Page 50