ABSTRACT - XP

advertisement
VISVESVARAYA TECHNOLOGICAL UNIVERSITY
BELGAUM
A Project Report On
“SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS”
Submitted in partial fulfillment of the requirements for the award of the degree
COMPUTER SCIENCE & ENGINEERING
Submitted by
Ravi N(4JC05CS081)
Under the guidance of
Mr. Harish B.S.
Lecturer
Dept of CS&E, SJCE
Mysore-570006
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SRI JAYACHAMARAJENDRA COLLEGE OF ENGINEERING
(Affiliated to Visvesvaraya Technological University, Belgaum)
MYSORE-570006
2008-2009
SRI JAYACHAMARAJENDRA COLLEGE OF ENGINEERING
(Affiliated to Visvesvaraya Technological University, Belgaum)
MYSORE-570006
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
CERTIFICATE
This is to certify that the project work entitled “Supervised Classification of
Text Documents” is a bonafied work carried out by Ravi N (4JC05CS081),
in partial fulfillment for the award of degree of Bachelor of Engineering in Computer
Science & Engineering of the Visvesvaraya Technological University, Belgaum
during the year 2008-2009. It is certified that all corrections/suggestions indicated for
Internal Assessment have been incorporated in the Report deposited in the
departmental library. The project report has been approved as it satisfies the academic
requirements in respect of Project Work prescribed for the above mentioned degree.
Signature of the Guide
Signature of the HOD
(Mr. Harish B.S.)
(Dr. C.N. Ravikumar)
Signature of Principal
(Dr. B.G.Sangameshwara)
Place: Mysore
Examiners:
Date:
1.
2.
ABSTRACT
In the past few decades there has been a tremendous growth in the text
documents used in the enterprises. Text categorization is a task of assigning text
documents to pre-specified classes of documents which helps in organizing and
finding the information on the huge set of document resources in the enterprises.
There are many challenges in the task of categorizing these documents mainly due to
the large number of attributes present in the data set, large number of training samples
and attribute dependencies. In this project an attempt is made to implement the text
classifiers, centroid classifier and KNN classifier, that ease the task of classifying the
documents in the data set to pre-specified classes. The classifiers implemented in this
project allows to classify a new document based upon how closely its behavior
matches the behavior of the documents belonging to different pre-specified classes.
This type of matching allows us to dynamically adjust for classes with different
densities and accounts for dependencies between the terms in the different classes.
This project also evaluates the two classifiers based upon the different kinds of
document representation schemes. It also demonstrates the accuracy obtained in each
of the classifiers when the documents are represented using different schemes, an
attempt has been made to compare the two classifiers based upon their accuracy
obtained.
Acknowledgement
It is with an immense sense of satisfaction and achievement that we have
completed this project. I take this opportunity to acknowledge each and every
individual who contributed towards this endeavor
I extend our deep regards to Dr. B. G. Sangameshwara, Honorable Principal
of Sri Jayachamarajendra College of Engineering for providing an excellent
environment for our education and his encouragement throughout our stay in SJCE.
I
would
like
to
convey
my
heartfelt
thanks
to
our
HOD
Dr. C.N. Ravikumar, Computer Science and Engineering, for giving me the
opportunity to embark upon this topic and for his continued encouragement
throughout the preparation of this presentation.
I would also like to sincerely thank my guide Mr. HARISH B.S, for his
invaluable guidance, constant assistance, support, endurance and constructive
suggestions for the betterment of the work, without which this work would not have
been possible.
I am also indebted to my parents, friends for their continued moral support
throughout the course of the work and in helping me finalize the presentation.
My heartfelt thanks to all those people who have contributed bits, bytes and
words to accomplish this work.
RAVI N
Declaration
I hereby declare that the work presented in this project entitled “Supervised
Classification of Text Documents” submitted towards completion of project in
Eighth Semester of B.E (CS) at the Sri Jayachamarajendra College of Engineering,
Mysore, is an authentic record of my original work carried out under the guidance of
Mr. Harish B.S, Lecturer, Dept. of Computer Science and Engineering, SJCE,
Mysore. I have not submitted the matter embodied in this project anywhere else.
Ravi N
(4JC05CS081)
TABLE OF CONTENTS
Chapter No.
Title
Page. No.
List of Figures
List of Tables
1
Introduction
1.1 Introduction to Information Retrieval
1
1.1.1
Definition of IR
3
1.1.2
Different Fields of IR
4
1.2 Machine Learning
1.2.1
Types of Machine Learning
4
1.2.2
Clustering
5
1.2.3
Indexing
5
1.2.4
Retrieval
5
1.2.5
Classification
5
1.2.5.1 Types of Classification
7
Text Classification
7
1.2.6.1 Unsupervised Classification
7
1.2.6.2 Supervised Classification
8
1.2.6.3 Definition of Text Classification
9
1.2.6.4 Applications of Text Classification
9
1.2.6
2
3
1.3 Objective of our project
9
Literature Review
10
2.1 History Of IR
10
2.2 Timeline of IR
10
2.3 History of Text Classification
11
2.4 Previous Works on Text Classification
12
Document representation
3.1 Need for Document Representation
13
3.1.1. Bag-Of-Words vector space representation
15
3.1.2. Binary Document Representation
16
Chapter No.
4
Title
Page. No.
3.1.3 Term Frequency Representation
17
3.1.4. Probabilistic Representation
18
3.1.5. TF-IDF representation
19
Text classifiers
4.1 Definition of Text Classification(TC)
22
4.2 Classifier learning
22
4.3 Types of classifiers
24
4.3.1
Centroid classifier
24
4.3.1.1 Pseudo code of Centroid Classifier
24
4.3.1.2 Computational complexity of centroid
26
classifier
4.3.2
27
KNN classifier.
28
4.3.2.1 Pseudocode of kNN classifier
29
4.3.2.2 Computational complexity of centroid
classifier
5
Implementation aspects
5.1 Perl
30
5.1.1 Applications of Perl
31
5.1.2 Datastructures in perl
31
5.2 Perl Data Language (PDL)
6
31
5.2.1 Advantages of using PDL
36
5.2.2 A brief Outline on PDF Functions
37
5.3 Use of PDL in our project
37
5.4 Organisation of our code
39
5.5 Scripts of our project
42
Results and analysis
49
6.1 Centroid Classifier
50
6.2 KNN Classifier
52
6.3 Comparison of Centroid and KNN classifier.
57
Conclusion and Future Enhancements
58
References
59
List of Figures
S.No.
Title
Page.No.
Fig.1
Schematic representation of Information
2
Fig.5.1.
Usage of PDL in our Source Code
38
Fig.5.2
Organization of the code.
41
Fig. 6.1.1
Evaluation of Centroid classifier based upon
51
different Document Representation schemes.
Fig. 6.1.2
Accuracy obtained by the Centroid classifier in
51
each Document Representation
Fig. 6.2.1
Evaluation of KNN ( k=2) classifier based upon
52
different Document Representation schemes.
Fig. 6.2.2.
Accuracy obtained by the KNN classifier in each
53
Document Representation.
Fig. 6.2.3
Evaluation of KNN (k=3) classifier based upon
54
different Document Representation schemes.
Fig. 6.2.4
Accuracy obtained by the KNN (k=3) in each
54
Document Representation
Fig. 6.2.5
Evaluation of KNN (k=50)classifier based upon
55
different Document Representation schemes.
Fig. 6.2.6
Accuracy obtained by the KNN (k=50) in each
56
Document Representation
Fig. 6.3.1
Comparison of the Centroid and KNN(k=2)
classifier
57
List of Tables
S.No.
Title
Page.No.
Table 2.1
Previous works on Text classification
12
Table 3.1
Example Documents
15
Table 3.2
Vectors in Binary Independence Model
17
Table 3.3
Vectors in Binary Independence Model
18
Table 3.4
Vectors in Probabilistic Representation
18
Table 3.5
Vectors in tf-idf Representation
21
Table 6.1:
Result of the Centroid Classifier.
50
Table 6.2:
Result of the KNN (k=2) Classifier.
52
Table 6.3
Result of the KNN (k=3) Classifier.
53
Table 6.4
Result of the KNN (k=50) Classifier.
55
Download