Are they lovin' it? Sentiment Classification on #mcdonalds

advertisement
LiU – Text Mining VT 2013 – Project Presentation
Paolo Elena
Are they lovin' it?
Sentiment Classification
on #mcdonalds
Scary first slide, huh?
●
●
●
●
●
●
OpenOffice Impress isn't what you call a wonder of open
source, and I can't afford better
But worry not, it won't continue in Comic Sans MS
I did, however, use a Windows machine for all the programming
Even more scary, perhaps
Not entirely my fault: we have always used Windows here
It does make things harder for:
●
Getting little-known Python libraries to run
●
Installing tools that make life easier with Python or allow it
to run faster
●
Everything else
Introduction
●
●
●
●
Sentiment analysis is the identification of the prevailing
opinion, if any, in a text; most common categorizations
are positive-negative(-neutral), subjective-objective
I focused on supervised learning algorithms trained on
in-domain data (the same dataset on which the results
are later evaluated); many other approaches are
possible
I chose Twitter data for a given hashtag (#mcdonalds)
The idea was to then assess how successful the
models are on data collected about the same topic in
another language and machine translated to the
training language
Data Collection
●
●
●
●
●
I used code bits described the recommended book “Mining the
Social Web” by Matthew A. Russell – those are available for free
on Github (look for ptwobrussell)
The Twitter API has changed since the book was published, but
the author has kept the code updated
I downloaded 2100 tweets containing the hashtag #mcdonalds
(and only that, since not filtering gave too many unrelated
tweets), selecting English language only and excluding retweets
I exported them to an Excel table (library xlwt) and labelled them
by hand as “GOOD”, “BAD”, “NEUTR”
→ BORING, and a dive in interesting aspects of American culture
(or lack thereof), such as Man vs. Fast Food
After cleaning, I was left with 1908 classified data entries that I
imported back into a Python list (library xlrd)
Characteristics of Twitter data
●
●
●
●
●
●
Short bits of text (max 140 characters), often ill-structured
and noisy (weird punctuation, bad grammar, non-standard
spellings, slang, mistakes)
Emoticons appear fairly often
Links are sometimes included → very hard to judge
sentiment in those tweets, as one would likely have to open
them → I left those tweets out
Many times hard to judge sentiment even by a human
Sometimes a single word conveys all the sentiment in a
tweet
All of the above make it hard to achieve high performance in
classification
Preprocessing and Feature Extraction - 1
●
●
●
●
●
●
A key step in all kinds of text classification problems
I stuck to bag-of-words models, trying some combinations of
the steps mentioned below. Binary features:
contains(%word): True/False
Tokenization: I used two tokenizers – the basic nltk word
tokenizer, and a “sentiment-aware” tokenizer made by a
Stanford linguist (found at sentiment.christopherpotts.net)
that preserves constructs such as emoticons
Stemming: I used the Porter stemmer (anyway, stemming is
said to remove too much information for sentiment analysis)
Stopwords removal: I used the basic list included in nltk
Collocations: sometimes I included the N most frequent
bigrams in the features
Preprocessing and Feature Extraction - 2
●
●
●
●
Negation marking: since bag_of_words cannot distinguish
between “good” alone and “good” in “not good”, one of the
suggested strategies is to identify some negating words
(e.g. “not”, “never”, all those ending in “n't”) and append the
suffix “_NEG” to all tokens following those, until the end of
the sentence
The main classifier used was the standard Naive Bayes
classifier included in nltk
With the standard feature selection (top N most frequent
tokens) none of the above preprocessing steps proved
distinctly superior to the others (I settled on N=125)
I chose a featureset by hand as a comparison, then turned
to a more advanced feature selection method
Mutual information feature selection
●
●
●
The goal is to choose features that have a different
probability of occurring in tweets belonging to different
classes (e.g. “ :) “ occurs much more often in positive tweets
than in neutral or negative ones, hence it contains more
information than “the”, which is roughly uncorrelated to the
class)
To do this, an approach is to score relative frequencies of
tokens for each class (frequency of the token in tweets of
that class versus overall frequency of the token) using the
chi-squared metric, and select the N tweets with the highest
score
This led to a big improvement in classification performance
(see http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/)
Results – NB classifier
●
●
●
●
●
Class distribution was 35,5% GOOD, 27% BAD, 37.5%
NEUTRAL – fairly even, therefore it is quite safe to make
accuracy comparisons
I chose 75%-25% training-test split
Before mutual information feature selection, the classifier hovered
around 55% accuracy on the test set and 61% for the training set.
Not really satisfying, but since baseline for 3-class is around 33%
it was still much better than a 3-sided die (also claimed not to
exist)
The hand-picked features scored 64% - 57%, slightly better than
the basic machine methods
With feature selection, training accuracy jumped to 69% and test
accuracy jumped to 62% - again, negligible differences due to
pre-processing
A glimpse at the selected features
●
Comparing the 30 top features shows how much more
information is conveyed by the ones chosen by mutual
information scoring
A glimpse at the selected features - 2
●
●
●
Problem: the built-in show_most_informative_features for
NLTK classifiers does not show features that do not appear
in all 3 classes → interesting features may not appear, for
instance “:)” does not, as it is never associated with the BAD
label
Still, a conceivable extension is to identify informative words
and look at the tweets that contain them to identify prevailing
topics (e.g. Comments on a new product, comments on the
service at restaurants)
Just by printing by hand tweets that contain a given token
from the highest scoring ones seen before, one can see how
“gave” appears only in complaints for wrong deliveries,
“people” mostly appears in complaints about customers,
either by employees or fellow customers, etc...
Other classifiers
●
●
●
●
●
A more complex, but better choice for sentiment analysis is
the Maximum Entropy classifier, more well-known in
statistics as logistic regression
Problem: the nltk implementation only works if one reverts to
an older scipy library, and is really slow anyway (100x
slower than NB)
The scikit-learn library for Python does logistic regression,
but only for binary problems
Solution: I exported my featuresets to R and used the nnet
package, which has a multinomial logistic regression
implementation
The performances were really similar though, on accuracy,
precision and recall
Other classifiers - 2
●
●
●
●
Another good classifier is the Support Vector Machine – it
only works natively on binary classifications, though → I
fused labels together and went for GOOD vs NOT_GOOD
and BAD vs NOT_BAD classifications
I used the LinearSVC classifier from the scikit-learn library,
using the wrapper SklearnClassifier provided by nltk
SVM outperforms Naive Bayes on both “normal” feature sets
(73% vs 71% on the test set and 78% vs 77% on the
training set) and scored feature sets (79,5% vs 78% on the
test set and 85% vs 83% on the training set) – though this
might just be due to the different precision-recall behaviour
(the f-measure is only different by 0.5 percentage points)
Feature selection confirms itself as the most important step
Other datasets
●
●
●
●
●
I decided to test all my preprocessing functions on another
dataset: the built-in movie reviews of NLTK
Given the much bigger size of each review and variability of the
language used, I chose to extract 1000 features (as I had done in
the lab about it)
Feature scoring gave again a big boost in performance: from
84%-79% to 89%-88% (really good to note that there is very little
discrepancy between training and test error, suggesting that the
model is quite general)
SVM overfits the data markedly achieving 100% training accuracy
unless we lower nFeats to 250, and does not surpass 85% on the
test set
A very sparse space of binary features like this is not the best
input for SVM, and SVM is also known to require careful tuning
Other language?
●
●
●
●
●
I chose McDonalds as it is a well-known brand also in my
home country, Italy
I decided to download and label a small sample (300 tweets)
in Italian, translate them (uploading the table on Google
Docs and using the =GoogleTranslate(cell,
lang_from,lang_to) function in the online spreadsheet
However, I found huge differences even before starting
First: the use of Twitter and the way people write is different:
tweets in English had a mean of 12 tokens - 67 characters,
whereas tweets in Italian are longer: 19 tokens – 94 chars
Italian tweets are more articulated and structured (I guess it
is because of the different demographic of Twitter users in
the two countries: Twitter is not as widespread in Italy)
They may be lovin' it... but noi no
(or: McPasta, no grazie)
●
●
●
●
Second, the baseline sentiment in tweets is different! The
class distribution is 11% GOOD, 23% NEUTRAL and 66%
BAD
It wasn't this extreme when I picked the topic, but then
McDonalds released a new, badly-received ad (at least
according to Twitter users) & announced a partnership with
Barilla to offer pasta options in their restaurants →
OXYMORON, McDonalds and Italian food culture are like oil
and water, they do not mix
Disgust, mockery and sarcasm (← REALLY hard to pick up
by a computer) abounded in tweets
Classifiers without feature scoring managed 50% accuracy,
while the one with it only got 44%, suggesting that our
scoring makes features more domain-specific
Other developments
●
●
●
●
It would be interesting to compare these models on the same
data set, with models built on available sentiment lexicons: lists of
words associated with sentiment scores (es: SentiWordNet,
AFINN)
Those lexicons can be used both to build classifiers independent
from the training set (and therefore score texts without having to
label a sample manually) and to help in building better models;
the former approach is particularly interesting as labelled data is
hard to produce
Part-of-Speech tagging can make the above models better, and
rules can be introduced (es: defining a set of augmenters, such
as “very”, inverters such as “not”, etc)
Also, Python and nltk in particular are quite slow – many people in
the field of natural language processing use the Stanford NLP
tools and API, written in Java
Sources
●
●
●
●
●
http://sentiment.christopherpotts.net/ - a comprehensive tutorial on
sentiment analysis by a Stanford professor of linguistics
https://github.com/ptwobrussell/Mining-the-Social-Web - related to the
book of the same name, a set of Python scripts for mining social
websites such as Twitter
http://fjavieralba.com/basic-sentiment-analysis-with-python.html - an
example of a method I did not cover (a rule-based classifier built on a
tagged lexicon)
http://scikit-learn.org/ - a machine learning library for Python, which
contains many classifiers, preprocessing steps and a Pipeline
construct for putting the input through an arbitrary sequence of those
steps before training the classifiers
http://nlp.stanford.edu/software/classifier.shtml - the aforementioned
Stanford classifier
Download