02462 SIGNALS AND DATA, DTU COMPUTE, SPRING 2021 EXAM PROJECT ON TEXT CLASSIFICATION AND SENTIMENT ANALYSIS Jonas Schmidt Damgaards (s205822), Cornelius Erichs (s204085), Mikkel Thestrup (s204107) Group 9 Introduction The world has over the years evolved into a more and more digitalized world. If we look back thirty years, most articles were only to be found in physical papers, and only very little information were accessible on the internet. Since the internet has become a lot bigger today, a computerprogram has to look through huge amounts of data, when searching through information, to find the best match for a users search. Here a computer-program assigns the users input to a category, and finds the best match for the given category to show the user. This application is called textclassification. In other cases we are interested in classifying the overall emotions in a text, here sentiment analysis is used. This application is useful to analyze underlying patterns in the given texts. Both of the applications is widely used all over the internet today, in all sorts of search algorithms. These applications are researched in this paper. Firstly we classify genres of movies, afterwards we apply sentiment analysis on Donald Trumps inauguration speech. Methods We use the IMDB dataset, which consists of data with the genre, rating, plot, title and publishing year for movies listed on the website. Firstly we split our data-set into a training- and test set. This technique is used to determine if our model can work on unseen data, the data has been split into 15% for test and 85% for training. After our data is split, we then want to translate each word into a vector, which is needed for a text classification problem. In order to get a vector for each word, we use a pre-defind GloVe data-set to make a dictionary. By looking in the dictionary for a word, we get a vector of size 100. This is then used to make one vector for each movie, consisting of the mean vector for the plot. With these vectors we are able to perform PCA, which was done with the PCA function from sklearn.decomposition. Af- ter a dimensionality reduction, we are now able to plot our principal components, and the variance for each component. In order to use this method as a classifier, we apply our dimensonality reduction on the test set, to determine the genre. We use two different classifiers, spacial distance cosine (SDC) and Naive Bayes (NB). For SDC we compute 1 mean vector for each movie genre, and look at the cosine distance between these aforementioned vectors and test data. The SDC is found by: similarity = Bayes: p (θ |x ) = Pythons libraries. p(x|θ )p(θ) . p(x) A·B kAk kBk . And Naive These were implemented with By using the given code for FastText, we classified each movie’s genre by calling the predict-function for each movie plot. The PCA for the FastText embeddings were done the same way as our baseline method, calling sklearn’s PCA function, we are hereafter able to make plots for the PCA and variance for each principal component. To test our models on some data outside the IMDB dataset, we wrote our own plot to mimic a category: ”It was a dark night when Peter arrived at the motel. He quickly realised something was wrong. During the night strange noises came from the walls and items fell on the ground.”. For the sentiment analysis for the speech held by Donald Trump, we cleaned the data-set, and made window-wise sentiment calculations. Here each value in the vector is the sum of sentimental score for each word in the window. With this method we were able to plot the sentiment signal throughout the speech. Since the sentiment signal can change in value quickly, we then apply the Savgol filter from SciPy to smoothen the signal. In order to recreate our results, we choose the seed/random state of 15 throughtout the code. Results Table for accuracy Method SDC Train SDC Test NB Train NB Test FastText Test Accuracy 53.9% 49.3% 45.9% 42.4% 54.2% Fig. 1. In the table above we see the accuracy for the different methods Fig. 4. Third and fourth principal components of the GloVe method The principal components of the GloVe method Fig. 5. Explained variance ratio for the GloVe method Fig. 2. First two principal components of the GloVe method Confusion matrices Fig. 3. Second and third principal components of the GloVe method Fig. 6. Confusion Matrix with use of the GloVe method using Naive Bayes to classify genres Fig. 7. Confusion Matrix with use of the GloVe method using cosine distance to classify genres Fig. 10. Third and fourth principal components of the Fasttext implementation The principal components of the Fasttext implementation Fig. 11. Explained variance ratio for the Fasttext implementation Fig. 8. First two principal components of the Fasttext implementation Fig. 9. Second and third principal components of the Fasttext implementation Sentiment signals Fig. 12. Sentiment signal for Donald Trumps speech, window size: 5 and stride: 3 If we use the FastText-, NB- and SDC-classifier, on a movie plot we wrote ourself (see Methods), FastText and SDC predicted the genre correctly, which was Thriller, whilst NB predicted romantic. Although SDC predicted the correct genre in our own movie plot, it is a hollow victory, since when looking at the confusion matrix (fig. 7) we see that it predicts thriller in the majority of cases. When performing sentimental analysis on Donald Trumps inauguration speech, we observed that most of the windows had a positive sentimental score, but some also achieved a high positive score (achieving up to 10 in score, fig 12). And only some negatives, which could be interpreted as him trying to paint a positive picture of the future ahead. Fig. 13. Sentiment signal for Donald Trumps speech with a smoothing filter applied, filter has window size: 7 and polynomial order: 2 Discussion When we performed PCA on our GloVe vectors from the training set, we observed that the clusters are fairly close to each other, often overlapping. We also made an observation, that the Romantic genre was the only one of the three categories which were slightly more shifted to the negative end of the PC1 axis (fig. 2). This trend continues for PC2 vs PC3 (fig. 3) and PC3 vs PC2 (fig. 4), although in a different direction, showing that our training data is not as clearly separated as one could have hoped for, when performing PCA. This observation also shows up in our confusion matrix for SDC (fig. 7), since Thriller, Sci-Fi and Family are all in one big cluster, and romantic is the only one that is slightly shifted, this results in it only guessing thriller and romantic. We have chosen to work with 10 principle components for the GloVe method, this was chosen based on fig. 5, as this is where we see the biggest variance, after 10 principle components the slope of the cumulative explained variance begins to flatten. We have used two different methods for classification, NB and SDC. With NB we got an accuracy of 42.4% and with SDC we got an accuracy of 49.3%. This could be because of NB assuming that the features are independent of each other, while SDC doesn’t assume this. Although assuming they are independent is faster, we see a significant loss in accuracy with this assumption. When comparing the SDC- and NB method with the FastText implementation, we see that they achieve a lower accuracy than the one of FastText. The FastText implementation achieved an accuracy of 54.2%. Also when looking at our PCA, we see a much clearer distinction of our clusters (fig. 8). But overall the accuracy is still on the low end. Since the sentimental score is calculated window wise, a small window can result in rapid fluctuations (fig. 12), therefore a smoothing filter has been applied (fig. 13). We still observe fluctuations in the score, but small rapid fluctuations are less prominent. We achieved our desired smoothing with a window size of 7 and polynomial order of 2, for the Savgol Filter. Conclusion Based on our results, we can conclude that the Fasttext method had the highest accuracy on classifying the genres of the IMDB data-set texts, here the accuracy was 54.2%. Furthermore, based on our results for SDC and NB, the naive assumption of NB, has a significant accuracy reduction, since the features aren’t independent of each other. By analysing Donald Trumps inauguration speech, we can conclude that a lot of emotionally loaded words were used, especially positive words were used more often. 1. REFERENCES