Auto constructing Feature Clustering Algorithm for Text

advertisement
Auto constructing Feature Clustering Algorithm for Text Classification.
Pallavi M Deshmane.
M.E Comp.Sci
D.Y.P.I.E.T Pimpri
Pune, India
Pallavideshmane26@gmail.com
Prof.S.V.Chobe.
HOD of IT Dept
D.Y.P.I.E.T, Pimpri
Pune,India
sanchobe@yahoo.com
Abstract A feature clustering is a powerful technique to reduce the size of feature vector of text classification. In this paper we propose
classification of text by using self constructing feature clustering algorithm. The words in the feature are grouped into cluster. Based on
equally and automatically we then have one feature for each cluster. The words that are same to each other are grouped into same cluster.
If words are not same then new cluster is created automatically. Each cluster is grouped by membership function with statistical mean and
deviation. When all words are given as input clusters are created automatically.
Index Terms— Feature clustering, feature selection, Feature reduction, Text classification.
I. INTRODUCTION
T
he aim of classification of text is to assign
automatically a new document into one or more already
defined classes based on its contents present. Text
classification is also called text categorization, document
categorization and document classification. Two methods are
mainly used for text classification that are manual
classification and automatic classification. The text
classification applications are.
• Email Spam Filtering: a process which tries
to discern email spasm to legitimate mails
• Used in news writers to select imp topics.
• Categorize newspaper articles into topics.
•
Sort journals and abstracts into subject
categories
Clustering is one of the powerful methods for feature
extraction.
Word clustering is nothing but grouping of the
words with a high degree of pair wise semantic relatedness into
clusters and each word cluster contain the grouped features
treated as a single feature. In this way the dimensionality of the
features can
be drastically reduced.
The main purpose of feature reduction is to reduce the
classifiers computation load and to increase data consistency.
There are mainly two techniques for feature reduction, those
are feature selection and feature extraction. The feature
selection methods use technique like sampling which takes a
subset of the features and the classifiers only uses the subset
instead of all original features to perform the text
classification task. The feature extraction methods convert
the representation of the original documents to a new
representation based on a smaller set of synthesized features
A well known feature selection approach is based on
information gain measure defined by the amount of decresed
uncertainty given a piece of information. However there are
some problems associated with the feature selection based
methods. In these methods only a subset of the words is used
for classification of text data therefore useful information
may be ignored.
II. LITERATURE SURVEY
Support vector machines (SVMs) have been known as one of
the most successful classification methods for many
applications including text classification. Even though the
learning ability and computational complexity of training in
support vector. machines may be independent of the dimension
of the feature space, reducing computational complexity is an
essential issue to efficiently handle a large number of terms in
practical applications of text classification adopts novel
dimension reduction. Methods to reduce the dimension of the
document vectors dramatically. Exist decision functions for the
centric-based classification algorithm and support vector
classifiers. The Bottleneck approach is already there. The
divisive information-theoretic feature clustering, which is an
information-theoretic feature clustering approach, and is more1
effective than other feature clustering methods?
Track 2:Data mining.
In these word clustering methods, each new feature is
generated by combining a subset of the original words.
However, difficulties are associated with these methods.
Disadvantages:
1. In these feature clustering methods, each new feature is
generated by combining a subset of the original words.
However, difficulties are associated with these methods.
2. A word is exactly assigned to a subset, i.e., hard clustering,
based on the similarity magnitudes between the word and the
existing subsets, even if the differences among these
magnitudes are small.
Proposed system
1. I propose a classification of text using self constructing
feature clustering algorithm which is a clustering approach to
reduce the number of words for the text classification task.
2. The words in the feature vector of a document set are
represented as distributions, and execute one after another.
3. Words that are same to each other are grouped into the same
cluster. Each cluster is characterized by a statistical mean and
deviation.
4. If a word is not same to any existing cluster, a new cluster is
created for this word.
Advantages:
1.A classification of text using self constructing clustering
algorithm which is an incremental clustering approach to
reduce the dimensionality of the words in text classification
2. Determine the number of words automatically placed.
3. Runs faster than other method.
4. Better extracted words than other methods.
Activity diagram
Fig 2: Activity diagram
B. Modules
We divide this is in 4 modules.
Pre-processing
In this module we construct the word weight age pattern of
given document set. Read the document set and remove the
stop words.Get the feature vector from the given document
.Next we construct the word weight age pattern. Suppose, we
are given a document set D of n documents d1, d2, Dn,
Together with the feature vector W of m words w1, w2, . . .,
wm and p classes c1, c2, . . . , cp, we construct one word
weight age pattern for each word inW.
Preprocessing Flow:
III. IMPLEMENTATION DETAILS
A. Design
Use case diagram.
Fig 3: Pre-processing flow
Fig 1: Use case diagram
Automatic clustering
In this module we group the words using Automatic clustering
algorithm. For each word weighatage pattern, the similarity of
this word weightage pattern to each existing cluster is
calculated to decide whether it is combined into an existing
cluster or a new cluster is created. Once a new cluster is
created, the corresponding membership function should be
initialized. On the contrary, when the word weightage pattern 2
is combined into an existing cluster, the membership function
of that cluster should be updated accordingly
Fig 4: Automatic clustering
Fig 6: Classification of text
Word extraction
1.Word weighatage patterns have been grouped into clusters,
and words in the feature vector (W) are also clustered.
2. For one cluster, I have one extracted word. Since we have k
clusters,
3. The elements of T are derived based on the obtained
clusters,
and
feature
extraction
will
be
done.
4. I propose three weighting methods: best, better, and worst.
In the worst weighting approach, each word is only allowed to
belong to a cluster, and so it only contributes to a new
extracted feature
IV. OUR METHOD
Our proposed method is an agglomerative clustering
approach. The words in the feature vector document set are
represented as distributions and processed one after another.
Initially each word represents a cluster. Suppose a
document set D of n documents { d1,d2,d3….dn} together
with the feature vector W of m words {W1,W2….Wm} and
P
classes
{c1,c2….cp}.then
word
pattern
. For each word is constructed.
Based on these word pattern clusters are created. And is
defined as.
It is the word pattern on which our proposed algorithm
works. Principle component analysis is used to reduce this
word pattern X from P-dimension to 2 dimension. All
center coordinated should be positive. And within the range
of 0 to 1. Since it is fuzzy based approach. There for
transformation algorithm is used. For this purpose and
finally we get word pattern {x1,x2…. xm}. Fig 7
represent transformation algorithm.
Fig 5: word extraction
Classification of text
Given a set D of training documents, text classification can be
done as follows
3
Track 2:Data mining.
As shown in the fig.7 in this fig x axis indicates class number
And y axis indicates stories of each class
Fig.7
Fig 8 shows the execution time of other word reduction
method on rcv1 data.
Fig 7 Transformation Algorithm
Once the word pattern constructed we use clustering
algorithm to group the words into clusters. The clustering
algorithm is given below. Here in this we use word pattern
as input.
Fig 8: Execution time of all methods on RCV data.
B.Newsgroup data set
The newsgroup data set contain more than 20,000 articles and
these articles are evenly distributed over 20 classes. Each class
contain about 1000 articles as shown in fig 9.
End if;
Fig 9: Class distribution of news group data set
V. RESULTS
In this we present experimental result to show effectiveness of
our self constructing clustering algorithm. To that we take
three well known data sets .
A. Reuters corpus volume 1
RCV1 data set consist of 808,120 stories produced by reuter.
Fig 10 shows the execution time of other methods on
newsgroup data.
4
VI. CONCLUSION
Fig 10 Execution time of other method on newsgroup.
C .Cade 12
Cade 12 is contain set of web pages extracted from web
directory. And the web pages are classified into 12 classes as
shown in fig 11.
The proposed method is new in the text classification field. It
uses good optimization performance of support vector
machines to improve classification performance. Automatic
clustering is one of the methods that have developed in
machine learning research. In this paper, apply this clustering
technique to text categorization problems. And found that
when a document set is transformed to a collection of word
patterns, the relevance among word patterns can be measured,
and the word patterns can be grouped by applying similaritybased clustering algorithm. This method is good for text
categorization problems due to the suitability of the
distributional word clustering concept.
There are many methods presented by various researchers for
feature clustering algorithm for text classification. However
those methods are resulted into unsolved limitations such as in
these methods each new feature is generated by combining a
subset of the original words also the mean and variance of
clusters are not considered when similarity with respect to the
cluster is computed. Furthermore these methods require the
number of features be specified in the advance by the user.
Future scope
This clustering method is applied to solve the text
classification problems. This technique is also applied to the
other problem such as web mining, image segmentation, data
sampling and fuzzy modeling
ACKNOWLEDGMENT
Fig 11 Class distribution of cade 12 data set
Fig 12 shows the execution time of other methods on cade12
data
I have taken efforts in this project. However, it would not have
been possible without the kind support and help of many
individuals and organizations. I would like to extend my
sincere thanks to all of them.
I am highly indebted to Prof S V Chobe. for their guidance and
constant supervision as well as for providing necessary
information regarding the project & also for their support in
the project.
I would like to express my gratitude towards my parents &
member of D.Y.P.I.E.T Computer department. for their kind
co-operation and encouragement
REFERENCES
[1] H. Kim, P. Howland, and H. Park, “Dimension Reduction in
Text Classification
With Support Vector Machines,” J.
Machine Learning
Fig 12 Execution time of other methods on cade12
[2] D.D. Lewis, “Feature Selection and Feature Extraction for
Text Categorization,” Proc. Workshop Speech and Natural
Language, pp. 212-217, 1992.
[3] F. Pereira, N. Tishby, and L. Lee, “Distributional Clustering 5
of English Words,” Proc. 31st Ann. Meeting of ACL, pp. 183190, 1993
Track 2:Data mining.
[4] L.D. Baker and A. McCallum, “Distributional Clustering of
Words for Text Classification,” Proc. ACM SIGIR, pp. 96-103,
1998
[5] I.S. Dhillon, S. Mallela, and R. Kumar, “A Divisive
Infomation- Theoretic Feature Clustering Algorithm for Text
Classification,” J. Machine Learning Research, vol. 3, pp.
1265-1287, 2003.
[6] F. Sebastiani, “Machine Learning in Automated Text
Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp.
1-47, 2002
6
Download