VISVESVARAYA TECHNOLOGICAL UNIVERSITY BELGAUM A Project Report On “SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS” Submitted in partial fulfillment of the requirements for the award of the degree COMPUTER SCIENCE & ENGINEERING Submitted by Ravi N(4JC05CS081) Under the guidance of Mr. Harish B.S. Lecturer Dept of CS&E, SJCE Mysore-570006 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SRI JAYACHAMARAJENDRA COLLEGE OF ENGINEERING (Affiliated to Visvesvaraya Technological University, Belgaum) MYSORE-570006 2008-2009 SRI JAYACHAMARAJENDRA COLLEGE OF ENGINEERING (Affiliated to Visvesvaraya Technological University, Belgaum) MYSORE-570006 DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING CERTIFICATE This is to certify that the project work entitled “Supervised Classification of Text Documents” is a bonafied work carried out by Ravi N (4JC05CS081), in partial fulfillment for the award of degree of Bachelor of Engineering in Computer Science & Engineering of the Visvesvaraya Technological University, Belgaum during the year 2008-2009. It is certified that all corrections/suggestions indicated for Internal Assessment have been incorporated in the Report deposited in the departmental library. The project report has been approved as it satisfies the academic requirements in respect of Project Work prescribed for the above mentioned degree. Signature of the Guide Signature of the HOD (Mr. Harish B.S.) (Dr. C.N. Ravikumar) Signature of Principal (Dr. B.G.Sangameshwara) Place: Mysore Examiners: Date: 1. 2. ABSTRACT In the past few decades there has been a tremendous growth in the text documents used in the enterprises. Text categorization is a task of assigning text documents to pre-specified classes of documents which helps in organizing and finding the information on the huge set of document resources in the enterprises. There are many challenges in the task of categorizing these documents mainly due to the large number of attributes present in the data set, large number of training samples and attribute dependencies. In this project an attempt is made to implement the text classifiers, centroid classifier and KNN classifier, that ease the task of classifying the documents in the data set to pre-specified classes. The classifiers implemented in this project allows to classify a new document based upon how closely its behavior matches the behavior of the documents belonging to different pre-specified classes. This type of matching allows us to dynamically adjust for classes with different densities and accounts for dependencies between the terms in the different classes. This project also evaluates the two classifiers based upon the different kinds of document representation schemes. It also demonstrates the accuracy obtained in each of the classifiers when the documents are represented using different schemes, an attempt has been made to compare the two classifiers based upon their accuracy obtained. Acknowledgement It is with an immense sense of satisfaction and achievement that we have completed this project. I take this opportunity to acknowledge each and every individual who contributed towards this endeavor I extend our deep regards to Dr. B. G. Sangameshwara, Honorable Principal of Sri Jayachamarajendra College of Engineering for providing an excellent environment for our education and his encouragement throughout our stay in SJCE. I would like to convey my heartfelt thanks to our HOD Dr. C.N. Ravikumar, Computer Science and Engineering, for giving me the opportunity to embark upon this topic and for his continued encouragement throughout the preparation of this presentation. I would also like to sincerely thank my guide Mr. HARISH B.S, for his invaluable guidance, constant assistance, support, endurance and constructive suggestions for the betterment of the work, without which this work would not have been possible. I am also indebted to my parents, friends for their continued moral support throughout the course of the work and in helping me finalize the presentation. My heartfelt thanks to all those people who have contributed bits, bytes and words to accomplish this work. RAVI N Declaration I hereby declare that the work presented in this project entitled “Supervised Classification of Text Documents” submitted towards completion of project in Eighth Semester of B.E (CS) at the Sri Jayachamarajendra College of Engineering, Mysore, is an authentic record of my original work carried out under the guidance of Mr. Harish B.S, Lecturer, Dept. of Computer Science and Engineering, SJCE, Mysore. I have not submitted the matter embodied in this project anywhere else. Ravi N (4JC05CS081) TABLE OF CONTENTS Chapter No. Title Page. No. List of Figures List of Tables 1 Introduction 1.1 Introduction to Information Retrieval 1 1.1.1 Definition of IR 3 1.1.2 Different Fields of IR 4 1.2 Machine Learning 1.2.1 Types of Machine Learning 4 1.2.2 Clustering 5 1.2.3 Indexing 5 1.2.4 Retrieval 5 1.2.5 Classification 5 1.2.5.1 Types of Classification 7 Text Classification 7 1.2.6.1 Unsupervised Classification 7 1.2.6.2 Supervised Classification 8 1.2.6.3 Definition of Text Classification 9 1.2.6.4 Applications of Text Classification 9 1.2.6 2 3 1.3 Objective of our project 9 Literature Review 10 2.1 History Of IR 10 2.2 Timeline of IR 10 2.3 History of Text Classification 11 2.4 Previous Works on Text Classification 12 Document representation 3.1 Need for Document Representation 13 3.1.1. Bag-Of-Words vector space representation 15 3.1.2. Binary Document Representation 16 Chapter No. 4 Title Page. No. 3.1.3 Term Frequency Representation 17 3.1.4. Probabilistic Representation 18 3.1.5. TF-IDF representation 19 Text classifiers 4.1 Definition of Text Classification(TC) 22 4.2 Classifier learning 22 4.3 Types of classifiers 24 4.3.1 Centroid classifier 24 4.3.1.1 Pseudo code of Centroid Classifier 24 4.3.1.2 Computational complexity of centroid 26 classifier 4.3.2 27 KNN classifier. 28 4.3.2.1 Pseudocode of kNN classifier 29 4.3.2.2 Computational complexity of centroid classifier 5 Implementation aspects 5.1 Perl 30 5.1.1 Applications of Perl 31 5.1.2 Datastructures in perl 31 5.2 Perl Data Language (PDL) 6 31 5.2.1 Advantages of using PDL 36 5.2.2 A brief Outline on PDF Functions 37 5.3 Use of PDL in our project 37 5.4 Organisation of our code 39 5.5 Scripts of our project 42 Results and analysis 49 6.1 Centroid Classifier 50 6.2 KNN Classifier 52 6.3 Comparison of Centroid and KNN classifier. 57 Conclusion and Future Enhancements 58 References 59 List of Figures S.No. Title Page.No. Fig.1 Schematic representation of Information 2 Fig.5.1. Usage of PDL in our Source Code 38 Fig.5.2 Organization of the code. 41 Fig. 6.1.1 Evaluation of Centroid classifier based upon 51 different Document Representation schemes. Fig. 6.1.2 Accuracy obtained by the Centroid classifier in 51 each Document Representation Fig. 6.2.1 Evaluation of KNN ( k=2) classifier based upon 52 different Document Representation schemes. Fig. 6.2.2. Accuracy obtained by the KNN classifier in each 53 Document Representation. Fig. 6.2.3 Evaluation of KNN (k=3) classifier based upon 54 different Document Representation schemes. Fig. 6.2.4 Accuracy obtained by the KNN (k=3) in each 54 Document Representation Fig. 6.2.5 Evaluation of KNN (k=50)classifier based upon 55 different Document Representation schemes. Fig. 6.2.6 Accuracy obtained by the KNN (k=50) in each 56 Document Representation Fig. 6.3.1 Comparison of the Centroid and KNN(k=2) classifier 57 List of Tables S.No. Title Page.No. Table 2.1 Previous works on Text classification 12 Table 3.1 Example Documents 15 Table 3.2 Vectors in Binary Independence Model 17 Table 3.3 Vectors in Binary Independence Model 18 Table 3.4 Vectors in Probabilistic Representation 18 Table 3.5 Vectors in tf-idf Representation 21 Table 6.1: Result of the Centroid Classifier. 50 Table 6.2: Result of the KNN (k=2) Classifier. 52 Table 6.3 Result of the KNN (k=3) Classifier. 53 Table 6.4 Result of the KNN (k=50) Classifier. 55