International Journal of Engineering Trends and Technology (IJETT) - Volume 35 Number 4 - May 2016 AUTOMATIC EDUCATIONAL DOCUMENT CLASSIFICATION USING NATURAL LANGUAGE PROCESSING Spoorthi M Srilekha K Sanjana J Kushal Kumar B N Department of CSE K.S Institute of Technology, Bengaluru, India Department of CSE K.S Institute of Technology, Bengaluru, India Department of CSE K.S Institute of Technology, Bengaluru, India Department of CSE K.S Institute of Technology, Bengaluru, India Abstract—Automatic document classification is the task of assigning a document to one or more classes or categories. In this paper we use a supervised machine learning and content-based document classification of textual documents that are confined to four educational departments - Civil, Computer Science, Mechanical and Electrical Engineering. This study involves the usage of TF-IDF algorithm along with natural language processing for feature construction and selection. It also investigates ID3 (Iterative Dichotomiser 3) algorithm as a classifier for the given data set. The results give us 80% accuracy. Keywords- Supervised machine learning; Natural language processing; TF-IDF; Iterative Dichotomiser 3. I. INTRODUCTION In the present world, educational documents from various fields are increasing because of the variety of sources of information available for the students. The Manual document classification of these documents is known to be an expensive and a time-consuming job. Hence, various machine learning approaches suggest automatic construction of classifiers, that make the task easier. The main goal of classification is to extract information from textual resources and to deal with operations like retrieval, classification (supervised, unsupervised and semi supervised) and summarization. In this paper we consider supervised, content-based document classification. Content-based classification is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned and Supervised document classification is where in some external mechanism (such as human feedback) provides information on the correct classification for the documents. Various applications of this would include topic-specific processing and information extraction, filtering and routing of documents, along with easier access to various documents that also help in management. The numerous techniques investigated in the paper are Natural Language Processing (NLP), Text classification and Machine Learning algorithms, that work together to automatically analyze, classify and discover patterns from the different types of the documents. ISSN: 2231-5381 Natural language processing is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. The various NLP operations and modules help in building better statistical machine learning algorithms. Here NLP operations are used for the data cleaning process. Text classification (TC) helps in manually building automatic TC systems by means of knowledge-engineering techniques, i.e. manually defining a set of logical rules that convert expert knowledge on how to classify documents under the given set of categories. The classification task starts with a training set D = (d1…………dn) of documents that are already labeled with classes C1,C2,so on (e.g. computer, civil). The task is then to determine which class is best suitable for a new document with given data. For example would be to automatically label each incoming document with a topic like “computers”, “computing”, or “networks” to a class labeled "Computer Science". The remainder of the paper is organized as follows Section 2 gives the proposed system and how the overall process takes place. In Section 3 the data cleaning and feature extraction is explained. In Section 4 Feature vector representation in the form of TF-IDF algorithm is considered. The ID3 classifier algorithm follows in Section 5. Finally we have the experimental results in Section 6. Followed by the Conclusion in Section 7. II. PROPOSED SYSTEM The proposed system involves the composition of various techniques and thereby can mainly be divided into two distinct phases - The training phase and the testing phase. The training/learning phase (as shown in the Figure.1) involves that the user provide a set of sample documents with the textual information that belongs to the four main departments considered - Computer Science, Electronics, Mechanical and Civil Engineering. The document collection used for learning/training should be as representative as http://www.ijettjournal.org Page 152 International Journal of Engineering Trends and Technology (IJETT) - Volume 35 Number 4 - May 2016 possible for the documents that one expects to encounter in the future. These documents are then fed into the Learner for the creation of the training data. The Learner analyzes the sample documents and tries to identify the content that is specific for each category. The analysis mainly involves generalization and abstraction of the data. The analyzed content is then preprocessed and cleaned. The output is a model for each category which is represented by a set of classifier features that are constructed and selected with the help Natural language processing and the TF-IDF algorithm. The testing/classification phase (as shown in the Figure.1) involves that the user provide a new set of documents which are preprocessed to obtain a testing data set. This testing data set is then passed on to the Classifier The Classifier classifies the new documents to their specific categories by comparing the testing data set to the previously created training data set based on the ID3 classification algorithm. As there could be a large testing data set, efficiency is very important for the classification phase. The output, are a set of directories automatically created that contain the classified textual documents that belong to various fields as mentioned in the abstract. called tokens. In this system, we use the 'nltk.tokenize' module to do the same. The list of tokens become input for further processing. B. Upper Case Conversion The tokens obtained from the previous phase contain several words with different cases. Hence all the upper case letters are converted to lower case letters. C. Removal of Stopwords and Punctuation The data, now contains a lot of words that may not be necessary for the classification of documents. These are words that are usually the most common words in a language such as a, is, the, because etc. In this system, we use the 'nltk stopwords' module. Similarily, the various punctuations are also removed. D. Part-of-Speech Tagging Part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech. Here, we use 'nltk.tag' module to tag the tokens and only 'Nouns' are considered. E. Calculation of Frequencies The words considered after data cleaning are all strings of characters. So the Term Frequency or Count of a word is calculated and words are filtered based on a threshold frequency value. Figure.1.Document Classification Figure.2.Data Cleaning and Feature extraction III. DATA CLEANING AND FEATURE EXTRACTION Before data mining algorithms can be used, a target data set must be assembled. The sample documents have a massive set of data that may not actually be required for the classification process, this could include stop words, noisy data, ambiguous data or missing values. Data cleaning and data preprocessing remove these unnecessary data and the result is a model with a good set of features. The various processes (as shown in Figure.2) done under data cleaning are as follows: A. Tokenization Tokenization is the process of breaking a stream of text, into words, phrases, symbols, or other meaningful elements ISSN: 2231-5381 IV. FEATURE VECTORS A feature vector is an n-dimensional vector of numerical features that represent some object. Many algorithms in machine learning require a numerical representation of objects, since such representations facilitate processing and statistical analysis. Hence, for improvement in the classification quality for all classifiers, we use the TF-IDF algorithm. Tf–Idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how http://www.ijettjournal.org Page 153 International Journal of Engineering Trends and Technology (IJETT) - Volume 35 Number 4 - May 2016 important a word is to a document in a collection or corpus.[1,2] Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF) and the second term is the Inverse Document Frequency (IDF). TF: Term Frequency, measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: there are no examples in the subset, this happens when no example in the parent set was found to be matching a specific value of the selected attribute. Then a leaf is created, and labeled with the most common class of the examples in the parent set. B. Measures The algorithm uses two measures for finding the splitting attribute - Entropy and Information Gain. 1) Entropy Entropy H(S) is a measure of the amount of uncertainty in the (data) set S (i.e. entropy characterizes the (data) set S). H(S) = - ∑Pi log2Pi TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document). IDF: Inverse Document Frequency, measures the importance of a term. As there are many words that occur many times in various documents, there is a need to weigh down the frequent terms while the rare ones are scaled up, this is done by computing the following: IDF(t) = log_e(Total number of documents / Number of documents with term t in it). V. Where, S - The current (data) set for which entropy is being calculated (changes every iteration of the ID3 algorithm) Pi - The proportion of the number of elements in class i to the number of elements in set S 2) Information Gain Information gain IG(A) is the measure of the difference in entropy from before to after the set S is split on an attribute A. In other words, how much uncertainty in S was reduced after splitting set S on attribute A. IG(A,S) = H(S) - ∑P(i)H(i) ITERATIVE DICHOTOMISER 3 CLASSIFIER In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan[3] used to generate a decision tree from a dataset. A. Algorithm The ID3 algorithm begins with the original set S as the root node. On each iteration of the algorithm, it iterates through every unused attribute of the set S and calculates the entropy H(S) (or information gain IG(A)) of that attribute. It then selects the attribute which has the smallest entropy (or largest information gain) value. The set S is then split by the selected to produce subsets of the data. The algorithm continues to recurse on each subset, considering only attributes never selected before. Recursion on a subset may stop in one of these cases: every element in the subset belongs to the same class (+ or -), then the node is turned into a leaf and labeled with the class of the examples there are no more attributes to be selected, but the examples still do not belong to the same class (some are + and some are -), then the node is turned into a leaf and labeled with the most common class of the examples in the subset ISSN: 2231-5381 (1) (2) Where, H(S) - Entropy of set S p(i) - The proportion of the number of elements in i to the number of elements in set S H(i) - Entropy of subset i VI. EXPERIMENTAL RESULTS The results of the system can be summarized as follows: A. Preprocessing The application of NLP operations and TF-IDF algorithm show a great improvement from the naïve preprocessing techniques by producing the rare words that occur in each document for a better classification process, thereby reducing the classification time from many hours to mere minutes. B. Classification With around 100 documents of varying sizes, Iterative Dichotomiser 3 was able to successfully classify nearly 80% of the documents compared to other algorithms for the given data set. According to Table.1, the number of Computer Science, Mechanical, Electronics and Civil documents correctly classified were seen to be 8, 18, 25 and 32 respectively, with around 17 documents not classified correctly. http://www.ijettjournal.org Page 154 International Journal of Engineering Trends and Technology (IJETT) - Volume 35 Number 4 - May 2016 Table.1.Document Classified Departments Documents Classified C.S 8 MECH 18 EC 25 Civil 32 Not Classified 17 Total Documents 100 The Graphical representation of the document classification is shown in Figure.4 with the number of documents taken as the y axis and the various categories taken as the x-axis. Figure.5.The Documents selected VII. CONCLUSION The paper deals with the classification of various educational documents that help in topic-specific processing and information extraction, filtering and routing of documents, along with easier access to various documents. Figure.3.Number of Documents vs Categories C. Graphical User Interface The graphical user interface is highly flexible and provides the user a very easy and an interactive experience in the classification of the documents. The user can either mention a directory path, that consists of the documents to be classified or select specific documents from a list of documents(as shown in Figure.4) NLP operations along with TF-IDF proves to work well for preprocessing, thereby helping in the better classification of the documents from the given data. We have used ID3 as the classifier for the data set, which showed a good result. Finally the paper focuses on the development and evaluation of a system that proves to be helpful in various educational sectors of the country. ACKNOWLEDGEMENT The Authors would like to thank VGST (Vision Group on Science and Technology), Government of Karnataka, India for providing infrastructure facilities through the K-FIST Level II project at KSIT, CSE R&D Department, Bengaluru REFERENCES [1] [2] [3] [4] [5] Rajar aman, A.; Ullman, J. D. (2011). "Data Mining". Mining of Massiv Datasets (PDF). pp. 1–17. doi:10.1017/CBO9781139058452.002. ISBN 9781139058452. Juan Ramos," Using TF-IDF to Determine Word Relevance in Document Queries", Department of Computer Science, Rutgers University, 23515 BPO Way, Piscataway, NJ, 08855 Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106 Murat Can Ganiz, Melike Tutkan, Selim Akyokus. A Novel Classifier based on Meaning for Text Classification, IEEE-2015 Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data Mining, Pearson Education,2005 Figure.4.Document Classification Once the 'classify' button is pressed, the documents are classified and the output of the created directories are printed. The selected documents can be viewed again, in case any document is to be added or removed (as shown in Figure.5). ISSN: 2231-5381 http://www.ijettjournal.org Page 155