Uploaded by junaidamjad543

Text Classification

advertisement
Assignment 2
Submitted to: Dr Kashif Javed
Submitted by: Muhammad Junaid
2018-EE-71 (Section C)
ELECTRICAL ENGEENERING DEPARTMENT,
UET LAHORE
Abstract
In this assignment we are doing classification of 20 different classes of web pages through
text classification. We have used preprocessed data set such that our data set was in the form
of BOW (Bag of words). The classifiers which we used are Random Forest, Support vector
machine (SVM) and Logistic Regression. The 20 classes include People, Television, Health,
Art, Cable, Culture, Music, Film, Business, Politics, Sports, Media, Review, Technology,
Stage, Entertainment, Online, Industry, Variety and Multimedia. Finally, we applied
ANOVA test to check how much algorithms are statistically different from each other.
Introduction
Text Classification involves assigning a text document to a set of pre-defined classes, using
a machine learning technique (Classification). The classification is usually done on the basis
of significant words or features extracted from the text document (Usually known as bag of
words (BOW)). Since the classes are pre-defined it is a supervised machine learning task.
The data set which we used contains stemmed words from different websites of 20 different
classes and it was preprocessed in the form of BOW and each word behave as feature and
every sample is represented by these features such that for every sample the value of feature
is assigned based upon the frequency of occurrence or that specific word. Total number of
samples were 1560 and we split them randomly in 70% training and 30% testing data sets
for 3 times and then apply three different classifiers (Random Forest, Support vector
machine (SVM) and Logistic Regression) and take average of accuracies to get the final
accuracies of classifiers.
In order to compare accuracies of all three classifiers which we used we also performed
ANOVA test. ANOVA test is a statistic-based test which tells us statistic difference between
multiple gaussian curves.
Automatic text classification has several useful applications such as classifying text
documents in electronic format, spam filtering, improving search results of search engines,
opinion detection and opinion mining from online reviews of products, movies or political
situations and text sentiment mining.
Methodology
Data Set: First of all, we started with preprocessed data set in the form of BOW. We load
the data set into our environment through panda’s framework and then converted them into
NumPy arrays as input to the classifiers. Then we split our dataset randomly three times with
70% training and 30% testing weightage.
Classifiers: We import all three classifiers (Random Forest, Support vector machine (SVM)
and Logistic Regression) from Sklearn and train our model on our training data and then test
it on our testing data.
ANOVA: We collect accuracies of three classifiers on three different splits and then apply
ANOVA test on them to check statistic difference between three classifiers.
Experimental Settings
Starting with importing some libraries and data set:
Figure 1: Importing Libraries and Data set
Then splitting data set randomly and applying classifiers:
Figure 2: Classifiers
Then Applying ANOVA test:
Figure 3: ANOVA Test
Results:
Average Accuracies of all three classifiers on all three splits are given as:
Random Forest = 75.85470085
SVM = 78.84615385
Logistic Regression = 84.04558405
The comparison of all three accuracies on all three random splits are given as follows:
Figure 4: Classifiers’ Accuracies
Conclusion
In this assignment we have observed the behavior of three classifiers on text classification.
On given data set all three classifiers are somewhat giving good accuracies which can
further be improved by changing different parameters in our classifiers such as Depth, max
leaf nodes, random state (Random Forest), Kernel, tolerance (SVM), number of iterations
(Logistic Regression), etc.
We applied ANOVA test to find out the statistical difference between all classifiers and we
did not apply T-test because difference between T-test and ANOVA is such that T-test is a
method that determines whether two populations are statistically different from each other,
whereas ANOVA determines whether three or more populations are statistically different
from each other.
References
[1] https://scikit-learn.org/stable/
[2] https://www.statology.org/one-way-anova-python/
[3] https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
[4] https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
[5] https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
[6] https://www.raybiotech.com/learning-center/t-testanova/#:~:text=The%20t%2Dtest%20is%20a,statistically%20different%20from%20each%2
0other.
Download