Assignment 2 Submitted to: Dr Kashif Javed Submitted by: Muhammad Junaid 2018-EE-71 (Section C) ELECTRICAL ENGEENERING DEPARTMENT, UET LAHORE Abstract In this assignment we are doing classification of 20 different classes of web pages through text classification. We have used preprocessed data set such that our data set was in the form of BOW (Bag of words). The classifiers which we used are Random Forest, Support vector machine (SVM) and Logistic Regression. The 20 classes include People, Television, Health, Art, Cable, Culture, Music, Film, Business, Politics, Sports, Media, Review, Technology, Stage, Entertainment, Online, Industry, Variety and Multimedia. Finally, we applied ANOVA test to check how much algorithms are statistically different from each other. Introduction Text Classification involves assigning a text document to a set of pre-defined classes, using a machine learning technique (Classification). The classification is usually done on the basis of significant words or features extracted from the text document (Usually known as bag of words (BOW)). Since the classes are pre-defined it is a supervised machine learning task. The data set which we used contains stemmed words from different websites of 20 different classes and it was preprocessed in the form of BOW and each word behave as feature and every sample is represented by these features such that for every sample the value of feature is assigned based upon the frequency of occurrence or that specific word. Total number of samples were 1560 and we split them randomly in 70% training and 30% testing data sets for 3 times and then apply three different classifiers (Random Forest, Support vector machine (SVM) and Logistic Regression) and take average of accuracies to get the final accuracies of classifiers. In order to compare accuracies of all three classifiers which we used we also performed ANOVA test. ANOVA test is a statistic-based test which tells us statistic difference between multiple gaussian curves. Automatic text classification has several useful applications such as classifying text documents in electronic format, spam filtering, improving search results of search engines, opinion detection and opinion mining from online reviews of products, movies or political situations and text sentiment mining. Methodology Data Set: First of all, we started with preprocessed data set in the form of BOW. We load the data set into our environment through panda’s framework and then converted them into NumPy arrays as input to the classifiers. Then we split our dataset randomly three times with 70% training and 30% testing weightage. Classifiers: We import all three classifiers (Random Forest, Support vector machine (SVM) and Logistic Regression) from Sklearn and train our model on our training data and then test it on our testing data. ANOVA: We collect accuracies of three classifiers on three different splits and then apply ANOVA test on them to check statistic difference between three classifiers. Experimental Settings Starting with importing some libraries and data set: Figure 1: Importing Libraries and Data set Then splitting data set randomly and applying classifiers: Figure 2: Classifiers Then Applying ANOVA test: Figure 3: ANOVA Test Results: Average Accuracies of all three classifiers on all three splits are given as: Random Forest = 75.85470085 SVM = 78.84615385 Logistic Regression = 84.04558405 The comparison of all three accuracies on all three random splits are given as follows: Figure 4: Classifiers’ Accuracies Conclusion In this assignment we have observed the behavior of three classifiers on text classification. On given data set all three classifiers are somewhat giving good accuracies which can further be improved by changing different parameters in our classifiers such as Depth, max leaf nodes, random state (Random Forest), Kernel, tolerance (SVM), number of iterations (Logistic Regression), etc. We applied ANOVA test to find out the statistical difference between all classifiers and we did not apply T-test because difference between T-test and ANOVA is such that T-test is a method that determines whether two populations are statistically different from each other, whereas ANOVA determines whether three or more populations are statistically different from each other. References [1] https://scikit-learn.org/stable/ [2] https://www.statology.org/one-way-anova-python/ [3] https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html [4] https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html [5] https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html [6] https://www.raybiotech.com/learning-center/t-testanova/#:~:text=The%20t%2Dtest%20is%20a,statistically%20different%20from%20each%2 0other.