LiU – Text Mining VT 2013 – Project Presentation Paolo Elena Are they lovin' it? Sentiment Classification on #mcdonalds Scary first slide, huh? ● ● ● ● ● ● OpenOffice Impress isn't what you call a wonder of open source, and I can't afford better But worry not, it won't continue in Comic Sans MS I did, however, use a Windows machine for all the programming Even more scary, perhaps Not entirely my fault: we have always used Windows here It does make things harder for: ● Getting little-known Python libraries to run ● Installing tools that make life easier with Python or allow it to run faster ● Everything else Introduction ● ● ● ● Sentiment analysis is the identification of the prevailing opinion, if any, in a text; most common categorizations are positive-negative(-neutral), subjective-objective I focused on supervised learning algorithms trained on in-domain data (the same dataset on which the results are later evaluated); many other approaches are possible I chose Twitter data for a given hashtag (#mcdonalds) The idea was to then assess how successful the models are on data collected about the same topic in another language and machine translated to the training language Data Collection ● ● ● ● ● I used code bits described the recommended book “Mining the Social Web” by Matthew A. Russell – those are available for free on Github (look for ptwobrussell) The Twitter API has changed since the book was published, but the author has kept the code updated I downloaded 2100 tweets containing the hashtag #mcdonalds (and only that, since not filtering gave too many unrelated tweets), selecting English language only and excluding retweets I exported them to an Excel table (library xlwt) and labelled them by hand as “GOOD”, “BAD”, “NEUTR” → BORING, and a dive in interesting aspects of American culture (or lack thereof), such as Man vs. Fast Food After cleaning, I was left with 1908 classified data entries that I imported back into a Python list (library xlrd) Characteristics of Twitter data ● ● ● ● ● ● Short bits of text (max 140 characters), often ill-structured and noisy (weird punctuation, bad grammar, non-standard spellings, slang, mistakes) Emoticons appear fairly often Links are sometimes included → very hard to judge sentiment in those tweets, as one would likely have to open them → I left those tweets out Many times hard to judge sentiment even by a human Sometimes a single word conveys all the sentiment in a tweet All of the above make it hard to achieve high performance in classification Preprocessing and Feature Extraction - 1 ● ● ● ● ● ● A key step in all kinds of text classification problems I stuck to bag-of-words models, trying some combinations of the steps mentioned below. Binary features: contains(%word): True/False Tokenization: I used two tokenizers – the basic nltk word tokenizer, and a “sentiment-aware” tokenizer made by a Stanford linguist (found at sentiment.christopherpotts.net) that preserves constructs such as emoticons Stemming: I used the Porter stemmer (anyway, stemming is said to remove too much information for sentiment analysis) Stopwords removal: I used the basic list included in nltk Collocations: sometimes I included the N most frequent bigrams in the features Preprocessing and Feature Extraction - 2 ● ● ● ● Negation marking: since bag_of_words cannot distinguish between “good” alone and “good” in “not good”, one of the suggested strategies is to identify some negating words (e.g. “not”, “never”, all those ending in “n't”) and append the suffix “_NEG” to all tokens following those, until the end of the sentence The main classifier used was the standard Naive Bayes classifier included in nltk With the standard feature selection (top N most frequent tokens) none of the above preprocessing steps proved distinctly superior to the others (I settled on N=125) I chose a featureset by hand as a comparison, then turned to a more advanced feature selection method Mutual information feature selection ● ● ● The goal is to choose features that have a different probability of occurring in tweets belonging to different classes (e.g. “ :) “ occurs much more often in positive tweets than in neutral or negative ones, hence it contains more information than “the”, which is roughly uncorrelated to the class) To do this, an approach is to score relative frequencies of tokens for each class (frequency of the token in tweets of that class versus overall frequency of the token) using the chi-squared metric, and select the N tweets with the highest score This led to a big improvement in classification performance (see http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/) Results – NB classifier ● ● ● ● ● Class distribution was 35,5% GOOD, 27% BAD, 37.5% NEUTRAL – fairly even, therefore it is quite safe to make accuracy comparisons I chose 75%-25% training-test split Before mutual information feature selection, the classifier hovered around 55% accuracy on the test set and 61% for the training set. Not really satisfying, but since baseline for 3-class is around 33% it was still much better than a 3-sided die (also claimed not to exist) The hand-picked features scored 64% - 57%, slightly better than the basic machine methods With feature selection, training accuracy jumped to 69% and test accuracy jumped to 62% - again, negligible differences due to pre-processing A glimpse at the selected features ● Comparing the 30 top features shows how much more information is conveyed by the ones chosen by mutual information scoring A glimpse at the selected features - 2 ● ● ● Problem: the built-in show_most_informative_features for NLTK classifiers does not show features that do not appear in all 3 classes → interesting features may not appear, for instance “:)” does not, as it is never associated with the BAD label Still, a conceivable extension is to identify informative words and look at the tweets that contain them to identify prevailing topics (e.g. Comments on a new product, comments on the service at restaurants) Just by printing by hand tweets that contain a given token from the highest scoring ones seen before, one can see how “gave” appears only in complaints for wrong deliveries, “people” mostly appears in complaints about customers, either by employees or fellow customers, etc... Other classifiers ● ● ● ● ● A more complex, but better choice for sentiment analysis is the Maximum Entropy classifier, more well-known in statistics as logistic regression Problem: the nltk implementation only works if one reverts to an older scipy library, and is really slow anyway (100x slower than NB) The scikit-learn library for Python does logistic regression, but only for binary problems Solution: I exported my featuresets to R and used the nnet package, which has a multinomial logistic regression implementation The performances were really similar though, on accuracy, precision and recall Other classifiers - 2 ● ● ● ● Another good classifier is the Support Vector Machine – it only works natively on binary classifications, though → I fused labels together and went for GOOD vs NOT_GOOD and BAD vs NOT_BAD classifications I used the LinearSVC classifier from the scikit-learn library, using the wrapper SklearnClassifier provided by nltk SVM outperforms Naive Bayes on both “normal” feature sets (73% vs 71% on the test set and 78% vs 77% on the training set) and scored feature sets (79,5% vs 78% on the test set and 85% vs 83% on the training set) – though this might just be due to the different precision-recall behaviour (the f-measure is only different by 0.5 percentage points) Feature selection confirms itself as the most important step Other datasets ● ● ● ● ● I decided to test all my preprocessing functions on another dataset: the built-in movie reviews of NLTK Given the much bigger size of each review and variability of the language used, I chose to extract 1000 features (as I had done in the lab about it) Feature scoring gave again a big boost in performance: from 84%-79% to 89%-88% (really good to note that there is very little discrepancy between training and test error, suggesting that the model is quite general) SVM overfits the data markedly achieving 100% training accuracy unless we lower nFeats to 250, and does not surpass 85% on the test set A very sparse space of binary features like this is not the best input for SVM, and SVM is also known to require careful tuning Other language? ● ● ● ● ● I chose McDonalds as it is a well-known brand also in my home country, Italy I decided to download and label a small sample (300 tweets) in Italian, translate them (uploading the table on Google Docs and using the =GoogleTranslate(cell, lang_from,lang_to) function in the online spreadsheet However, I found huge differences even before starting First: the use of Twitter and the way people write is different: tweets in English had a mean of 12 tokens - 67 characters, whereas tweets in Italian are longer: 19 tokens – 94 chars Italian tweets are more articulated and structured (I guess it is because of the different demographic of Twitter users in the two countries: Twitter is not as widespread in Italy) They may be lovin' it... but noi no (or: McPasta, no grazie) ● ● ● ● Second, the baseline sentiment in tweets is different! The class distribution is 11% GOOD, 23% NEUTRAL and 66% BAD It wasn't this extreme when I picked the topic, but then McDonalds released a new, badly-received ad (at least according to Twitter users) & announced a partnership with Barilla to offer pasta options in their restaurants → OXYMORON, McDonalds and Italian food culture are like oil and water, they do not mix Disgust, mockery and sarcasm (← REALLY hard to pick up by a computer) abounded in tweets Classifiers without feature scoring managed 50% accuracy, while the one with it only got 44%, suggesting that our scoring makes features more domain-specific Other developments ● ● ● ● It would be interesting to compare these models on the same data set, with models built on available sentiment lexicons: lists of words associated with sentiment scores (es: SentiWordNet, AFINN) Those lexicons can be used both to build classifiers independent from the training set (and therefore score texts without having to label a sample manually) and to help in building better models; the former approach is particularly interesting as labelled data is hard to produce Part-of-Speech tagging can make the above models better, and rules can be introduced (es: defining a set of augmenters, such as “very”, inverters such as “not”, etc) Also, Python and nltk in particular are quite slow – many people in the field of natural language processing use the Stanford NLP tools and API, written in Java Sources ● ● ● ● ● http://sentiment.christopherpotts.net/ - a comprehensive tutorial on sentiment analysis by a Stanford professor of linguistics https://github.com/ptwobrussell/Mining-the-Social-Web - related to the book of the same name, a set of Python scripts for mining social websites such as Twitter http://fjavieralba.com/basic-sentiment-analysis-with-python.html - an example of a method I did not cover (a rule-based classifier built on a tagged lexicon) http://scikit-learn.org/ - a machine learning library for Python, which contains many classifiers, preprocessing steps and a Pipeline construct for putting the input through an arbitrary sequence of those steps before training the classifiers http://nlp.stanford.edu/software/classifier.shtml - the aforementioned Stanford classifier