Uploaded by Johan Nino

Exam project on text classification and sentiment analysis

advertisement
02462 SIGNALS AND DATA, DTU COMPUTE, SPRING 2021
EXAM PROJECT ON TEXT CLASSIFICATION AND SENTIMENT ANALYSIS
Jonas Schmidt Damgaards (s205822), Cornelius Erichs (s204085), Mikkel Thestrup (s204107)
Group 9
Introduction
The world has over the years evolved into a more and more
digitalized world. If we look back thirty years, most articles
were only to be found in physical papers, and only very little
information were accessible on the internet.
Since the internet has become a lot bigger today, a computerprogram has to look through huge amounts of data, when
searching through information, to find the best match for a
users search. Here a computer-program assigns the users
input to a category, and finds the best match for the given
category to show the user. This application is called textclassification.
In other cases we are interested in classifying the overall emotions in a text, here sentiment analysis is used. This application is useful to analyze underlying patterns in the given texts.
Both of the applications is widely used all over the internet
today, in all sorts of search algorithms. These applications are
researched in this paper. Firstly we classify genres of movies,
afterwards we apply sentiment analysis on Donald Trumps
inauguration speech.
Methods
We use the IMDB dataset, which consists of data with the
genre, rating, plot, title and publishing year for movies listed
on the website.
Firstly we split our data-set into a training- and test set. This
technique is used to determine if our model can work on unseen data, the data has been split into 15% for test and 85%
for training.
After our data is split, we then want to translate each word into
a vector, which is needed for a text classification problem. In
order to get a vector for each word, we use a pre-defind GloVe
data-set to make a dictionary. By looking in the dictionary for
a word, we get a vector of size 100. This is then used to make
one vector for each movie, consisting of the mean vector for
the plot.
With these vectors we are able to perform PCA, which was
done with the PCA function from sklearn.decomposition. Af-
ter a dimensionality reduction, we are now able to plot our
principal components, and the variance for each component.
In order to use this method as a classifier, we apply our dimensonality reduction on the test set, to determine the genre. We
use two different classifiers, spacial distance cosine (SDC)
and Naive Bayes (NB). For SDC we compute 1 mean vector
for each movie genre, and look at the cosine distance between
these aforementioned vectors and test data.
The SDC is found by: similarity =
Bayes: p (θ |x ) =
Pythons libraries.
p(x|θ )p(θ)
.
p(x)
A·B
kAk kBk .
And Naive
These were implemented with
By using the given code for FastText, we classified each
movie’s genre by calling the predict-function for each movie
plot. The PCA for the FastText embeddings were done the
same way as our baseline method, calling sklearn’s PCA
function, we are hereafter able to make plots for the PCA and
variance for each principal component.
To test our models on some data outside the IMDB dataset,
we wrote our own plot to mimic a category: ”It was a dark
night when Peter arrived at the motel. He quickly realised
something was wrong. During the night strange noises came
from the walls and items fell on the ground.”.
For the sentiment analysis for the speech held by Donald
Trump, we cleaned the data-set, and made window-wise sentiment calculations. Here each value in the vector is the sum
of sentimental score for each word in the window. With this
method we were able to plot the sentiment signal throughout the speech. Since the sentiment signal can change in
value quickly, we then apply the Savgol filter from SciPy to
smoothen the signal.
In order to recreate our results, we choose the seed/random
state of 15 throughtout the code.
Results
Table for accuracy
Method
SDC Train
SDC Test
NB Train
NB Test
FastText Test
Accuracy
53.9%
49.3%
45.9%
42.4%
54.2%
Fig. 1. In the table above we see the accuracy for the different
methods
Fig. 4. Third and fourth principal components of the GloVe
method
The principal components of the GloVe method
Fig. 5. Explained variance ratio for the GloVe method
Fig. 2. First two principal components of the GloVe method
Confusion matrices
Fig. 3. Second and third principal components of the GloVe
method
Fig. 6. Confusion Matrix with use of the GloVe method using
Naive Bayes to classify genres
Fig. 7. Confusion Matrix with use of the GloVe method using
cosine distance to classify genres
Fig. 10. Third and fourth principal components of the Fasttext
implementation
The principal components of the Fasttext implementation
Fig. 11. Explained variance ratio for the Fasttext implementation
Fig. 8. First two principal components of the Fasttext implementation
Fig. 9. Second and third principal components of the Fasttext
implementation
Sentiment signals
Fig. 12. Sentiment signal for Donald Trumps speech, window
size: 5 and stride: 3
If we use the FastText-, NB- and SDC-classifier, on a movie
plot we wrote ourself (see Methods), FastText and SDC predicted the genre correctly, which was Thriller, whilst NB predicted romantic. Although SDC predicted the correct genre in
our own movie plot, it is a hollow victory, since when looking
at the confusion matrix (fig. 7) we see that it predicts thriller
in the majority of cases.
When performing sentimental analysis on Donald Trumps inauguration speech, we observed that most of the windows had
a positive sentimental score, but some also achieved a high
positive score (achieving up to 10 in score, fig 12). And only
some negatives, which could be interpreted as him trying to
paint a positive picture of the future ahead.
Fig. 13. Sentiment signal for Donald Trumps speech with a
smoothing filter applied, filter has window size: 7 and polynomial order: 2
Discussion
When we performed PCA on our GloVe vectors from the
training set, we observed that the clusters are fairly close to
each other, often overlapping. We also made an observation,
that the Romantic genre was the only one of the three categories which were slightly more shifted to the negative end of
the PC1 axis (fig. 2). This trend continues for PC2 vs PC3
(fig. 3) and PC3 vs PC2 (fig. 4), although in a different direction, showing that our training data is not as clearly separated
as one could have hoped for, when performing PCA.
This observation also shows up in our confusion matrix for
SDC (fig. 7), since Thriller, Sci-Fi and Family are all in
one big cluster, and romantic is the only one that is slightly
shifted, this results in it only guessing thriller and romantic.
We have chosen to work with 10 principle components for
the GloVe method, this was chosen based on fig. 5, as this is
where we see the biggest variance, after 10 principle components the slope of the cumulative explained variance begins to
flatten.
We have used two different methods for classification, NB
and SDC. With NB we got an accuracy of 42.4% and with
SDC we got an accuracy of 49.3%. This could be because of
NB assuming that the features are independent of each other,
while SDC doesn’t assume this. Although assuming they are
independent is faster, we see a significant loss in accuracy
with this assumption.
When comparing the SDC- and NB method with the FastText implementation, we see that they achieve a lower accuracy than the one of FastText. The FastText implementation
achieved an accuracy of 54.2%. Also when looking at our
PCA, we see a much clearer distinction of our clusters (fig.
8). But overall the accuracy is still on the low end.
Since the sentimental score is calculated window wise, a
small window can result in rapid fluctuations (fig. 12), therefore a smoothing filter has been applied (fig. 13). We still
observe fluctuations in the score, but small rapid fluctuations
are less prominent. We achieved our desired smoothing with
a window size of 7 and polynomial order of 2, for the Savgol
Filter.
Conclusion
Based on our results, we can conclude that the Fasttext
method had the highest accuracy on classifying the genres
of the IMDB data-set texts, here the accuracy was 54.2%.
Furthermore, based on our results for SDC and NB, the naive
assumption of NB, has a significant accuracy reduction, since
the features aren’t independent of each other.
By analysing Donald Trumps inauguration speech, we can
conclude that a lot of emotionally loaded words were used,
especially positive words were used more often.
1. REFERENCES
Download