Auto constructing Feature Clustering Algorithm for Text Classification. Pallavi M Deshmane. M.E Comp.Sci D.Y.P.I.E.T Pimpri Pune, India Pallavideshmane26@gmail.com Prof.S.V.Chobe. HOD of IT Dept D.Y.P.I.E.T, Pimpri Pune,India sanchobe@yahoo.com Abstract A feature clustering is a powerful technique to reduce the size of feature vector of text classification. In this paper we propose classification of text by using self constructing feature clustering algorithm. The words in the feature are grouped into cluster. Based on equally and automatically we then have one feature for each cluster. The words that are same to each other are grouped into same cluster. If words are not same then new cluster is created automatically. Each cluster is grouped by membership function with statistical mean and deviation. When all words are given as input clusters are created automatically. Index Terms— Feature clustering, feature selection, Feature reduction, Text classification. I. INTRODUCTION T he aim of classification of text is to assign automatically a new document into one or more already defined classes based on its contents present. Text classification is also called text categorization, document categorization and document classification. Two methods are mainly used for text classification that are manual classification and automatic classification. The text classification applications are. • Email Spam Filtering: a process which tries to discern email spasm to legitimate mails • Used in news writers to select imp topics. • Categorize newspaper articles into topics. • Sort journals and abstracts into subject categories Clustering is one of the powerful methods for feature extraction. Word clustering is nothing but grouping of the words with a high degree of pair wise semantic relatedness into clusters and each word cluster contain the grouped features treated as a single feature. In this way the dimensionality of the features can be drastically reduced. The main purpose of feature reduction is to reduce the classifiers computation load and to increase data consistency. There are mainly two techniques for feature reduction, those are feature selection and feature extraction. The feature selection methods use technique like sampling which takes a subset of the features and the classifiers only uses the subset instead of all original features to perform the text classification task. The feature extraction methods convert the representation of the original documents to a new representation based on a smaller set of synthesized features A well known feature selection approach is based on information gain measure defined by the amount of decresed uncertainty given a piece of information. However there are some problems associated with the feature selection based methods. In these methods only a subset of the words is used for classification of text data therefore useful information may be ignored. II. LITERATURE SURVEY Support vector machines (SVMs) have been known as one of the most successful classification methods for many applications including text classification. Even though the learning ability and computational complexity of training in support vector. machines may be independent of the dimension of the feature space, reducing computational complexity is an essential issue to efficiently handle a large number of terms in practical applications of text classification adopts novel dimension reduction. Methods to reduce the dimension of the document vectors dramatically. Exist decision functions for the centric-based classification algorithm and support vector classifiers. The Bottleneck approach is already there. The divisive information-theoretic feature clustering, which is an information-theoretic feature clustering approach, and is more1 effective than other feature clustering methods? Track 2:Data mining. In these word clustering methods, each new feature is generated by combining a subset of the original words. However, difficulties are associated with these methods. Disadvantages: 1. In these feature clustering methods, each new feature is generated by combining a subset of the original words. However, difficulties are associated with these methods. 2. A word is exactly assigned to a subset, i.e., hard clustering, based on the similarity magnitudes between the word and the existing subsets, even if the differences among these magnitudes are small. Proposed system 1. I propose a classification of text using self constructing feature clustering algorithm which is a clustering approach to reduce the number of words for the text classification task. 2. The words in the feature vector of a document set are represented as distributions, and execute one after another. 3. Words that are same to each other are grouped into the same cluster. Each cluster is characterized by a statistical mean and deviation. 4. If a word is not same to any existing cluster, a new cluster is created for this word. Advantages: 1.A classification of text using self constructing clustering algorithm which is an incremental clustering approach to reduce the dimensionality of the words in text classification 2. Determine the number of words automatically placed. 3. Runs faster than other method. 4. Better extracted words than other methods. Activity diagram Fig 2: Activity diagram B. Modules We divide this is in 4 modules. Pre-processing In this module we construct the word weight age pattern of given document set. Read the document set and remove the stop words.Get the feature vector from the given document .Next we construct the word weight age pattern. Suppose, we are given a document set D of n documents d1, d2, Dn, Together with the feature vector W of m words w1, w2, . . ., wm and p classes c1, c2, . . . , cp, we construct one word weight age pattern for each word inW. Preprocessing Flow: III. IMPLEMENTATION DETAILS A. Design Use case diagram. Fig 3: Pre-processing flow Fig 1: Use case diagram Automatic clustering In this module we group the words using Automatic clustering algorithm. For each word weighatage pattern, the similarity of this word weightage pattern to each existing cluster is calculated to decide whether it is combined into an existing cluster or a new cluster is created. Once a new cluster is created, the corresponding membership function should be initialized. On the contrary, when the word weightage pattern 2 is combined into an existing cluster, the membership function of that cluster should be updated accordingly Fig 4: Automatic clustering Fig 6: Classification of text Word extraction 1.Word weighatage patterns have been grouped into clusters, and words in the feature vector (W) are also clustered. 2. For one cluster, I have one extracted word. Since we have k clusters, 3. The elements of T are derived based on the obtained clusters, and feature extraction will be done. 4. I propose three weighting methods: best, better, and worst. In the worst weighting approach, each word is only allowed to belong to a cluster, and so it only contributes to a new extracted feature IV. OUR METHOD Our proposed method is an agglomerative clustering approach. The words in the feature vector document set are represented as distributions and processed one after another. Initially each word represents a cluster. Suppose a document set D of n documents { d1,d2,d3….dn} together with the feature vector W of m words {W1,W2….Wm} and P classes {c1,c2….cp}.then word pattern . For each word is constructed. Based on these word pattern clusters are created. And is defined as. It is the word pattern on which our proposed algorithm works. Principle component analysis is used to reduce this word pattern X from P-dimension to 2 dimension. All center coordinated should be positive. And within the range of 0 to 1. Since it is fuzzy based approach. There for transformation algorithm is used. For this purpose and finally we get word pattern {x1,x2…. xm}. Fig 7 represent transformation algorithm. Fig 5: word extraction Classification of text Given a set D of training documents, text classification can be done as follows 3 Track 2:Data mining. As shown in the fig.7 in this fig x axis indicates class number And y axis indicates stories of each class Fig.7 Fig 8 shows the execution time of other word reduction method on rcv1 data. Fig 7 Transformation Algorithm Once the word pattern constructed we use clustering algorithm to group the words into clusters. The clustering algorithm is given below. Here in this we use word pattern as input. Fig 8: Execution time of all methods on RCV data. B.Newsgroup data set The newsgroup data set contain more than 20,000 articles and these articles are evenly distributed over 20 classes. Each class contain about 1000 articles as shown in fig 9. End if; Fig 9: Class distribution of news group data set V. RESULTS In this we present experimental result to show effectiveness of our self constructing clustering algorithm. To that we take three well known data sets . A. Reuters corpus volume 1 RCV1 data set consist of 808,120 stories produced by reuter. Fig 10 shows the execution time of other methods on newsgroup data. 4 VI. CONCLUSION Fig 10 Execution time of other method on newsgroup. C .Cade 12 Cade 12 is contain set of web pages extracted from web directory. And the web pages are classified into 12 classes as shown in fig 11. The proposed method is new in the text classification field. It uses good optimization performance of support vector machines to improve classification performance. Automatic clustering is one of the methods that have developed in machine learning research. In this paper, apply this clustering technique to text categorization problems. And found that when a document set is transformed to a collection of word patterns, the relevance among word patterns can be measured, and the word patterns can be grouped by applying similaritybased clustering algorithm. This method is good for text categorization problems due to the suitability of the distributional word clustering concept. There are many methods presented by various researchers for feature clustering algorithm for text classification. However those methods are resulted into unsolved limitations such as in these methods each new feature is generated by combining a subset of the original words also the mean and variance of clusters are not considered when similarity with respect to the cluster is computed. Furthermore these methods require the number of features be specified in the advance by the user. Future scope This clustering method is applied to solve the text classification problems. This technique is also applied to the other problem such as web mining, image segmentation, data sampling and fuzzy modeling ACKNOWLEDGMENT Fig 11 Class distribution of cade 12 data set Fig 12 shows the execution time of other methods on cade12 data I have taken efforts in this project. However, it would not have been possible without the kind support and help of many individuals and organizations. I would like to extend my sincere thanks to all of them. I am highly indebted to Prof S V Chobe. for their guidance and constant supervision as well as for providing necessary information regarding the project & also for their support in the project. I would like to express my gratitude towards my parents & member of D.Y.P.I.E.T Computer department. for their kind co-operation and encouragement REFERENCES [1] H. Kim, P. Howland, and H. Park, “Dimension Reduction in Text Classification With Support Vector Machines,” J. Machine Learning Fig 12 Execution time of other methods on cade12 [2] D.D. Lewis, “Feature Selection and Feature Extraction for Text Categorization,” Proc. Workshop Speech and Natural Language, pp. 212-217, 1992. [3] F. Pereira, N. Tishby, and L. Lee, “Distributional Clustering 5 of English Words,” Proc. 31st Ann. Meeting of ACL, pp. 183190, 1993 Track 2:Data mining. [4] L.D. Baker and A. McCallum, “Distributional Clustering of Words for Text Classification,” Proc. ACM SIGIR, pp. 96-103, 1998 [5] I.S. Dhillon, S. Mallela, and R. Kumar, “A Divisive Infomation- Theoretic Feature Clustering Algorithm for Text Classification,” J. Machine Learning Research, vol. 3, pp. 1265-1287, 2003. [6] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002 6