AUTOMATIC EDUCATIONAL DOCUMENT CLASSIFICATION USING NATURAL LANGUAGE PROCESSING Spoorthi M

advertisement
International Journal of Engineering Trends and Technology (IJETT) - Volume 35 Number 4 - May 2016
AUTOMATIC EDUCATIONAL DOCUMENT
CLASSIFICATION
USING NATURAL LANGUAGE PROCESSING
Spoorthi M
Srilekha K
Sanjana J
Kushal Kumar B N
Department of CSE
K.S Institute of
Technology, Bengaluru,
India
Department of CSE
K.S Institute of
Technology, Bengaluru,
India
Department of CSE
K.S Institute of
Technology, Bengaluru,
India
Department of CSE
K.S Institute of
Technology, Bengaluru,
India
Abstract—Automatic document classification is the task of assigning
a document to one or more classes or categories. In this paper we use
a supervised machine learning and content-based document
classification of textual documents that are confined to four
educational departments - Civil, Computer Science, Mechanical and
Electrical Engineering. This study involves the usage of TF-IDF
algorithm along with natural language processing for feature
construction and selection. It also investigates ID3 (Iterative
Dichotomiser 3) algorithm as a classifier for the given data set. The
results give us 80% accuracy.
Keywords- Supervised machine learning; Natural language
processing; TF-IDF; Iterative Dichotomiser 3.
I.
INTRODUCTION
In the present world, educational documents from various
fields are increasing because of the variety of sources of
information available for the students. The Manual document
classification of these documents is known to be an expensive
and a time-consuming job. Hence, various machine learning
approaches suggest automatic construction of classifiers, that
make the task easier. The main goal of classification is to
extract information from textual resources and to deal with
operations
like
retrieval,
classification
(supervised,
unsupervised and semi supervised) and summarization.
In this paper we consider supervised, content-based
document classification. Content-based classification is
classification in which the weight given to particular subjects in
a document determines the class to which the document is
assigned and Supervised document classification is where in
some external mechanism (such as human feedback) provides
information on the correct classification for the documents.
Various applications of this would include topic-specific
processing and information extraction, filtering and routing of
documents, along with easier access to various documents that
also help in management.
The numerous techniques investigated in the paper are
Natural Language Processing (NLP), Text classification and
Machine Learning algorithms, that work together to
automatically analyze, classify and discover patterns from the
different types of the documents.
ISSN: 2231-5381
Natural language processing is a field of computer science,
artificial intelligence, and computational linguistics concerned
with the interactions between computers and human (natural)
languages. The various NLP operations and modules help in
building better statistical machine learning algorithms. Here
NLP operations are used for the data cleaning process.
Text classification (TC) helps in manually building
automatic TC systems by means of knowledge-engineering
techniques, i.e. manually defining a set of logical rules that
convert expert knowledge on how to classify documents under
the given set of categories.
The classification task starts with a training set D =
(d1…………dn) of documents that are already labeled with
classes C1,C2,so on (e.g. computer, civil). The task is then to
determine which class is best suitable for a new document with
given data.
For example would be to automatically label each incoming
document with a topic like “computers”, “computing”, or
“networks” to a class labeled "Computer Science".
The remainder of the paper is organized as follows Section 2 gives the proposed system and how the overall
process takes place. In Section 3 the data cleaning and feature
extraction is explained.
In Section 4 Feature vector representation in the form of
TF-IDF algorithm is considered. The ID3 classifier algorithm
follows in Section 5. Finally we have the experimental results
in Section 6. Followed by the Conclusion in Section 7.
II.
PROPOSED SYSTEM
The proposed system involves the composition of various
techniques and thereby can mainly be divided into two distinct
phases - The training phase and the testing phase.
The training/learning phase (as shown in the Figure.1)
involves that the user provide a set of sample documents with
the textual information that belongs to the four main
departments considered - Computer Science, Electronics,
Mechanical and Civil Engineering. The document collection
used for learning/training should be as representative as
http://www.ijettjournal.org
Page 152
International Journal of Engineering Trends and Technology (IJETT) - Volume 35 Number 4 - May 2016
possible for the documents that one expects to encounter in the
future. These documents are then fed into the Learner for the
creation of the training data.
The Learner analyzes the sample documents and tries to
identify the content that is specific for each category. The
analysis mainly involves generalization and abstraction of the
data. The analyzed content is then preprocessed and cleaned.
The output is a model for each category which is represented
by a set of classifier features that are constructed and selected
with the help Natural language processing and the TF-IDF
algorithm.
The testing/classification phase (as shown in the Figure.1)
involves that the user provide a new set of documents which
are preprocessed to obtain a testing data set. This testing data
set is then passed on to the Classifier
The Classifier classifies the new documents to their specific
categories by comparing the testing data set to the previously
created training data set based on the ID3 classification
algorithm. As there could be a large testing data set, efficiency
is very important for the classification phase. The output, are a
set of directories automatically created that contain the
classified textual documents that belong to various fields as
mentioned in the abstract.
called tokens. In this system, we use the 'nltk.tokenize' module
to do the same. The list of tokens become input for further
processing.
B. Upper Case Conversion
The tokens obtained from the previous phase contain
several words with different cases. Hence all the upper case
letters are converted to lower case letters.
C.
Removal of Stopwords and Punctuation
The data, now contains a lot of words that may not be
necessary for the classification of documents. These are words
that are usually the most common words in a language such as
a, is, the, because etc. In this system, we use the 'nltk
stopwords' module. Similarily, the various punctuations are
also removed.
D. Part-of-Speech Tagging
Part-of-speech tagging (POS tagging or POST), also called
grammatical tagging or word-category disambiguation, is the
process of marking up a word in a text (corpus) as
corresponding to a particular part of speech. Here, we use
'nltk.tag' module to tag the tokens and only 'Nouns' are
considered.
E. Calculation of Frequencies
The words considered after data cleaning are all strings of
characters. So the Term Frequency or Count of a word is
calculated and words are filtered based on a threshold
frequency value.
Figure.1.Document Classification
Figure.2.Data Cleaning and Feature extraction
III.
DATA CLEANING AND FEATURE EXTRACTION
Before data mining algorithms can be used, a target data set
must be assembled. The sample documents have a massive set
of data that may not actually be required for the classification
process, this could include stop words, noisy data, ambiguous
data or missing values. Data cleaning and data preprocessing
remove these unnecessary data and the result is a model with a
good set of features. The various processes (as shown in
Figure.2) done under data cleaning are as follows:
A. Tokenization
Tokenization is the process of breaking a stream of text,
into words, phrases, symbols, or other meaningful elements
ISSN: 2231-5381
IV.
FEATURE VECTORS
A feature vector is an n-dimensional vector of
numerical features that represent some object. Many
algorithms in machine learning require a numerical
representation of objects, since such representations facilitate
processing and statistical analysis. Hence, for improvement in
the classification quality for all classifiers, we use the TF-IDF
algorithm.
Tf–Idf, short for term frequency–inverse document frequency,
is a numerical statistic that is intended to reflect how
http://www.ijettjournal.org
Page 153
International Journal of Engineering Trends and Technology (IJETT) - Volume 35 Number 4 - May 2016
important a word is to a document in a collection or
corpus.[1,2]
Typically, the tf-idf weight is composed by two terms: the first
computes the normalized Term Frequency (TF) and the
second term is the Inverse Document Frequency (IDF).

TF: Term Frequency, measures how frequently a
term occurs in a document. Since every document is
different in length, it is possible that a term would
appear much more times in long documents than
shorter ones. Thus, the term frequency is often
divided by the document length (aka. the total
number of terms in the document) as a way of
normalization:

there are no examples in the subset, this happens
when no example in the parent set was found to be
matching a specific value of the selected attribute.
Then a leaf is created, and labeled with the most
common class of the examples in the parent set.
B. Measures
The algorithm uses two measures for finding the splitting
attribute - Entropy and Information Gain.
1) Entropy
Entropy H(S) is a measure of the amount of uncertainty
in the (data) set S (i.e. entropy characterizes the (data) set S).
H(S) = - ∑Pi log2Pi
TF(t) = (Number of times term t appears in a document) /
(Total number of terms in the document).

IDF: Inverse Document Frequency, measures the
importance of a term. As there are many words that
occur many times in various documents, there is a
need to weigh down the frequent terms while the rare
ones are scaled up, this is done by computing the
following:
IDF(t) = log_e(Total number of documents / Number of
documents with term t in it).
V.
Where,
S - The current (data) set for which entropy is being calculated
(changes every iteration of the ID3 algorithm)
Pi - The proportion of the number of elements in class i to the
number of elements in set S
2) Information Gain
Information gain IG(A) is the measure of the difference
in entropy from before to after the set S is split on an attribute
A. In other words, how much uncertainty in S was reduced
after splitting set S on attribute A.
IG(A,S) = H(S) - ∑P(i)H(i)
ITERATIVE DICHOTOMISER 3 CLASSIFIER
In decision tree learning, ID3 (Iterative Dichotomiser 3) is
an algorithm invented by Ross Quinlan[3] used to generate a
decision tree from a dataset.
A. Algorithm
The ID3 algorithm begins with the original set S as the root
node. On each iteration of the algorithm, it iterates through
every unused attribute of the set S and calculates the entropy
H(S) (or information gain IG(A)) of that attribute. It then
selects the attribute which has the smallest entropy (or largest
information gain) value. The set S is then split by the selected
to produce subsets of the data. The algorithm continues to
recurse on each subset, considering only attributes never
selected before.
Recursion on a subset may stop in one of these cases:

every element in the subset belongs to the same class
(+ or -), then the node is turned into a leaf and labeled
with the class of the examples

there are no more attributes to be selected, but the
examples still do not belong to the same class (some
are + and some are -), then the node is turned into a
leaf and labeled with the most common class of the
examples in the subset
ISSN: 2231-5381
(1)
(2)
Where,
H(S) - Entropy of set S
p(i) - The proportion of the number of elements in i to the
number of elements in set S
H(i) - Entropy of subset i
VI.
EXPERIMENTAL RESULTS
The results of the system can be summarized as follows:
A. Preprocessing
The application of NLP operations and TF-IDF algorithm
show a great improvement from the naïve preprocessing
techniques by producing the rare words that occur in each
document for a better classification process, thereby reducing
the classification time from many hours to mere minutes.
B. Classification
With around 100 documents of varying sizes, Iterative
Dichotomiser 3 was able to successfully classify nearly 80%
of the documents compared to other algorithms for the given
data set.
According to Table.1, the number of Computer Science,
Mechanical, Electronics and Civil documents correctly
classified were seen to be 8, 18, 25 and 32 respectively, with
around 17 documents not classified correctly.
http://www.ijettjournal.org
Page 154
International Journal of Engineering Trends and Technology (IJETT) - Volume 35 Number 4 - May 2016
Table.1.Document Classified
Departments
Documents Classified
C.S
8
MECH
18
EC
25
Civil
32
Not Classified
17
Total Documents
100
The Graphical representation of the document classification is
shown in Figure.4 with the number of documents taken as the
y axis and the various categories taken as the x-axis.
Figure.5.The Documents selected
VII.
CONCLUSION
The paper deals with the classification of various
educational documents that help in topic-specific processing
and information extraction, filtering and routing of documents,
along with easier access to various documents.
Figure.3.Number of Documents vs Categories
C. Graphical User Interface
The graphical user interface is highly flexible and
provides the user a very easy and an interactive experience in
the classification of the documents.
The user can either mention a directory path, that consists
of the documents to be classified or select specific documents
from a list of documents(as shown in Figure.4)
NLP operations along with TF-IDF proves to work well for
preprocessing, thereby helping in the better classification of the
documents from the given data. We have used ID3 as the
classifier for the data set, which showed a good result. Finally
the paper focuses on the development and evaluation of a
system that proves to be helpful in various educational sectors
of the country.
ACKNOWLEDGEMENT
The Authors would like to thank VGST (Vision Group on
Science and Technology), Government of Karnataka, India for
providing infrastructure facilities through the K-FIST Level II
project at KSIT, CSE R&D Department, Bengaluru
REFERENCES
[1]
[2]
[3]
[4]
[5]
Rajar aman, A.; Ullman, J. D. (2011). "Data Mining". Mining of Massiv
Datasets (PDF). pp. 1–17. doi:10.1017/CBO9781139058452.002. ISBN
9781139058452.
Juan Ramos," Using TF-IDF to Determine Word Relevance in
Document Queries", Department of Computer Science, Rutgers
University, 23515 BPO Way, Piscataway, NJ, 08855
Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1
(Mar. 1986), 81-106
Murat Can Ganiz, Melike Tutkan, Selim Akyokus. A Novel Classifier
based on Meaning for Text Classification, IEEE-2015
Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data
Mining, Pearson Education,2005
Figure.4.Document Classification
Once the 'classify' button is pressed, the documents are
classified and the output of the created directories are
printed.
The selected documents can be viewed again, in case any
document is to be added or removed (as shown in Figure.5).
ISSN: 2231-5381
http://www.ijettjournal.org
Page 155
Download