Julie Kane Ahkter (kanej) Steven Soria (ssoriajr) Sentiment Analysis: Facebook Status Messages Final Project CS224N Abstract While recent NLP-based sentiment analysis has centered around Twitter and product/service reviews, we believe it is possible to more accurately classify the emotion in Facebook status messages due to their nature. Facebook status messages are more succinct than reviews, and are easier to classify than tweets because their ability to contain more characters allows for better writing and a more accurate portrayal of emotions. We analyze the suitability of various approaches to Facebook status messages by comparing the performance of a Maximum Entropy (“MaxEnt”) classifier, a MaxEnt classifier augmented with LabeledLDA (“LDA”) data, a MaxEnt classifier augmented with part-of-speech (“POS”) tagging, and a MaxEnt classifier augmented with both LDA and POS data. We classify both binary and multi-class sentiment labeling. In both cases, a MaxEnt classifier augmented with POS data performs the best, achieving an average binary classification F1 score of approximately 85% and an average multi-class F1 score of approximately 67%. Introduction/Background With the transition from forum- and blog-based Internet communication among users to social networking sites such as Facebook and Twitter, there exists new opportunity for improved information mining via NLP sentiment analysis. Prior success at such analysis has been elusive due to the inherent difficulty in extracting a singleton sentiment label from long passages, where fluctuations over the document length make classification difficult. While sentiment analysis of Twitter data has surged in recent years, it too is problematic for the opposite reason: tweets are limited to 140 characters. Also, tweets often use heavy abbreviation and are more likely to be fragmented expressions, making it difficult to use a part-of-speech (“POS”) tagger. This project focuses on classifying sentiment of Facebook status updates (“status updates”) using binary and multi-class labels. Facebook makes a distinction between a Facebook user’s status update, versus links to a news article or other source of information, versus comments that are a response to another Facebook user. Unlike tweets, status updates can use up to 420 characters. Thus, status updates more often are written in mostly sentence-like structures that can benefit from POS analysis. Why perform sentiment analysis on Facebook status updates? According to a January 2010 article on InsideFacebook.com, users spent nearly 7 hours per person on Facebook in December 2009, far higher than the other top 10 parent companies on the Internet. From a marketing standpoint, understanding user sentiment as it relates to a topic of interest clearly allows more effective ad-targeting. If a user is trending positively or negatively about health care reform, appropriate political party ads might appear sympathetic to a user’s viewpoint. Similarly, a user that tends to write playful status messages might be shown ads for a local comedy club. Beyond the obvious marketing appeal, however, is a more sociologically compelling rationale: modeling the ebb and flow of consumer sentiment across a broad swath of topics, hierarchically encapsulated by user, locally-defined user communities, regionally, nationally, and globally. To our knowledge, there exists no prior work that focuses exclusively on sentiment analysis of Facebook status messages. Nonetheless, we found (Ramage, Dumais, Liebling)’s work on augmenting a MaxEnt classifier with LDA data particularly insightful. While their work 1 classifies tweets according to topic rather than sentiment, via an approach that extrapolates this from latent topics among the words in a tweet, the promise that a labeled LDA could combine the best of supervised learning while still discovering features in an unsupervised fashion seemed like a promising approach for modeling the inherent complexities of sentiment. Due to time constraints for this project, we chose to let the LDA tell us the ratio of words more closely associated with a fixed set of pre-labeled sentiments. (O’Connor, Balasubramanyan, Routledge, Smith)’s was the first paper we read that shared a similar vision of NLP work of providing insight into public opinion by modeling sentiment within various societal clusters. Importantly, however, their work performs binary labeling against a pre-defined corpus of known positive/negative words. Our work aims to learn sentiment non-deterministically, training on nothing more than the sentence and a label for the sentence as a whole. Data To collect data, we created and registered a Facebook Connect application, called iFeel. Allowing the app to require specific Facebook privileges allowed us to host the app from our stanford.edu accounts, making it available to anyone, not just our friends, and allowing us to gather status updates quickly and efficiently. iFeel is a PHP-based app that uses the newer Facebook Graph APIs, which simply use URL-based queries to return result sets such as status updates. Around iFeel, we wrapped an HTML-based front-end that disclosed privacy and disclaimer information such as the period of retention of data, the usage of data, anonymity of data, project scope and purpose, et cetera. At the time of this writing, iFeel lives at http://www.stanford.edu/~ssoriajr, although this is subject to removal at any time. iFeel collects the 25 most recent status updates from a logged-on Facebook user and their friends. Since Facebook users may have intersecting sets of friends, we performed a unique sort after merging all status updates to eliminate repeated status messages. Over a period of one week, we collected 62,202 unique, raw (unlabeled) Facebook status messages. After collecting the data, the next step involved preprocessing the raw Facebook data, converting it into labeled data and breaking it up into training and test sets. For this project, we used the Stanford Classifier v2.0 (the MaxEnt classifier), Stanford Tagger v3.0 (2010-05-26, Englishonly) (the POS tagger), and the Stanford Topic Modeling Toolbox v0.2.1 (the Labeled LDA). All of these tools are available from http://www-nlp.stanford.edu/software/. Some massaging of the data between stages was required. Our high-level sequence involved the following: 1) Raw data collection via iFeel 2) Sentiment labeling 3) Transform into train/test sets for classifier-only tests 4) Transform into train/test sets for LDA 5) Transform into train/test sets for POS 6) Transform train/test sets for final classification by classifier 7) Adjust classifier, LDA, and/or POS features and repeat until best model PREPROCESSING Adapting a strategy used by (O’Connor, Balasubramanyan, Routledge, Smith), we labeled status messages by filtering for known emoticon equivalence classes, as defined by the Wikipedia article, List of Emoticons, at http://en.wikipedia.org/wiki/List_of_emoticons. We chose to filter for 14 categories: angry, evil, frown, rock-on, shock, smiley, wink, angel, blush, fail, neutral, shades, skeptical, and tongue. To prevent training on the very emoticons upon which we based 2 our labeling, we then stripped these emoticons out of the data. For binary classification, we chose to classify into positive and negative sentiment labels. These labels are represented by: POSITIVE: smiley, wink, tongue, angel, shades, blush, rockon NEGATIVE: frown, shock, skeptical, evil, angry, fail Certainly arguments can be made about our choices (is shock always negative?), but we chose to maximize our dataset size, given time constraints on data collection. After dividing the categories in this way and truncating positive and negative sets to ensure they are the same size, we ended up with 4,320 usable, labeled samples. We chose to split the data into roughly 75% training / 25% testing, resulting in a training set of 3,500 samples and testing set of 820 samples. Positive and negative samples were split into groups of 25 and then interleaved so that they were approximately evenly distributed within the training and test set files. For the multi-class case, we chose to classify into four sentiment labels: unhappy, happy, skeptical, and playful, chosen to maximize the amount of usable data. These labels are represented by: UNHAPPY: evil, frown, shock, angry HAPPY: smiley SKEPTICAL: skeptical PLAYFUL: angel, rock-on, shades, tongue, wink Again, truncating each class so that they are all equal, we end up with 3,612 usable, labeled samples spanning the classes above. Since we reasoned that the multi-class classification would be harder, we chose to favor training set size over test set size, splitting the data into roughly 90% training / 10% testing. This resulted in a training set of 3,300 samples and testing set of 312 samples. As before, each label was interleaved roughly evenly into the training and test files. Next, we had to create versions of the training and test set files that were compatible with the tool to be used. After running POS and LDA against their respective training/test sets, we created Java programs to merge, in various ways commensurate with testing objectives, the resulting POS file of tagged status updates with the resulting LDA file of word-to-label ratios to produce a final training and test set to feed into the classifier. Features Based on previous experience, we expected n-gram word features to be the most significant such features, but also felt word-shape features would play a bigger role in this project, where sentiment polarity is often expressed by “!!”, “??”, “?!?!”, “……”, and so on. To this end, consistent across all our experiments, we split words in MaxEnt based on the regular expression, [, ] (comma, space). We expected that the degree to which we needed to tweak features between the binary and muli-label classification tasks would indicate whether Facebook messages truly encoded sentiment such that NLP models scale well to multiple sentiment labels. TWO CLASSES: POSITIVE AND NEGATIVE Our baseline results were determined by passing the training and test sets, without any POS or LDA data, directly into the classifier, using what we defined to be minimal n-gram based features. With those results, we adjusted classifier-specific features, such as word-shapes, observing any changes in the classifier’s performance and saving the best feature set for the classifier. LDA features were more limited but the resulting word-to-label ratios became the LDA’s best feature set and were combined with our classifier’s best features. POS features were uniquely computed since the POS tagger simply output tagged versions of the training and test data files. Thus, we merged the LDA and POS results to count certain characteristics of the POStagged files such as number of nouns, verbs, adjectives, et cetera. Those real-valued counts and 3 ratios, combined with LDA real-valued ratios, were further combined to produce new training and test files. Specifically, our testing typically used the following combinations of data: fbs.binomial.train/test fbs.binomial.LDA.train/test fbs.binomial.combined.train/test fbs.mc.train/test fbs.mc.LDA.train/test fbs.mc.combined.train/test binary labeled data with no POS tagging or features binary labeled data augmented with LDA features binary labeled data with POS tagged sentences and augmented with both LDA and POS features (could also be generated to contain POS tagged sentences but augmented with only POS features) multi-class labeled data with no POS tagging or features multi-class labeled data augmented with LDA features multi-class labeled data with POS tagged sentences and augmented with both LDA and POS features (could also be generated to contain POS tagged sentences but augmented with only POS features) CLASSIFIER From past experience, we’ve seen that modeling n-grams are valuable features in a MaxEnt model and thus we set a baseline that enabled all built-in n-gram related features of the Stanford Classifier. The Stanford Classifier also allowed us to specify the split-word regular expression to use, and so we chose to filter on spaces and commas. Our baseline feature set for the classifier only consists of: Stanford Classifier Baseline Features for Binary Classification Property Feature Name useSplitWords SW-str useSplitWordPairs SWP-str1-str2 useSplitFirstLastWords SFW-str SLW-str useSplitPrefixSuffixNGrams S#B-str S#E-str With this feature set to start with, our baseline F1 scores for the classifier were: Stanford Classifier Baseline F1 Score for Binary Classification Negative F1 Score Positive F1 Score 0.759 0.737 Even with these high F1 scores, we observed several interesting examples of misclassification. Consider the following POSITIVE status update example: is excited to officially have Fridays off work!! 3-day weekends and 4-day work weeks. Gotta love it!! In our baseline run, this is classified as NEGATIVE. The problem here is the word work. Among the top 200 weighted features, the classifier reports this: (1-SW-work,NEGATIVE) 0.7427 Observe that there are two instances of work in the sentence. This classification is, therefore, not an unreasonable one. Similarly, is excited is also given a negative score of 0.33, despite the fact that both is and excited are learned as positive. The sentence is thus scored as being 74% likely to be negative. If we switch to using the Stanford Classifier’s built-in word-shape feature set, “dan2”, the sentence is correctly classified as positive. 4 Classification example: is excited to officially have Fridays off work!! 3-day weekends and 4day work weeks. Gotta love it!! Classifier with baseline features NEGATIVE POSITIVE … 1-SW-work 0.74 -0.74 1-SFW-is -0.08 0.08 1-SW-excited -0.56 0.56 1-SWP-is-excited 0.33 -0.33 … Prob: 0.74 0.26 Classifier with “dan2” features NEGATIVE POSITIVE … 1-SSHAPE-WT-x! 1-SW-work 1-SFW-is 1-SW-excited 1-SWP-is-excited … Prob: -0.92 0.83 -0.19 -0.43 0.20 0.92 -0.83 0.19 0.43 -0.20 0.25 0.75 The “dan2” feature set creates features for word shapes with lowercase words, uppercase words, digits, mixed-case words, punctuated words, and equivalence classes words of the same shape with 3 characters or less. The key word-shape feature for this example is the x! shown above. This example has two words that end with a “!” character, which are now modeled. The overwhelming weight of x! as a positive influencer is enough to properly classify this example. Still, “dan2” remains imperfect, especially in borderline cases and extremely short status updates (ala Twitter), as in the status update: Disneyland tonight. Switching to the built-in “chris4” feature set corrects this misclassification. Classification example: Disneyland tonight Classifier with “dan2” features NEGATIVE POSITIVE … 1-SW-tonight 1-SW-Disneyland 1-SSHAPE-WT-x 1-SSHAPE-WT-Xx 1-SFW-Disneyland 1-SWP-Disneyland-tonight 1-SLW-tonight … Prob: -0.47 0.38 -0.49 -0.30 0.09 0.00 -0.10 0.47 -0.38 0.49 0.30 -0.09 0.00 0.10 0.52 0.48 Classifier with “chris4” features NEGATIVE POSITIVE … 1-SW-tonight -0.45 0.45 1-SW-Disneyland 0.34 -0.34 1-SSHAPE-xxxxx -0.28 0.28 1-SSHAPE-Xxxxx -0.41 0.41 1-SFW-Disneyland 0.07 -0.07 1-SWP-Disneyland-tonight 0.00 0.00 1-SLW-tonight -0.09 0.09 … Prob: 0.40 0.60 The key feature introduced by “chris4” is that longer mixed-case words are learned to be positive. “chris4” also treats, as separate features, distinguishable, 2-character head and tail sequences. The interiors of words are equivalence classed similarly to “dan2” in terms of case-sensitivity, digits, punctuation, and others. This method also ensures that words ≤ 4 characters are modeled precisely. In this case, while Disneyland remains an apparently negative word according to the classifier, the equivalence class Xxxxx is positive, resulting in a 12 point shift of probability mass to the positive classification. “chris4” addressed all the word-shape features we wanted to encode. In our testing, “chris2” performed identically to “chris4”, but we ultimately chose to standardize on “chris4” since it handles Unicode characters better. Thus, our final feature set appears below. For more information on built-in word-shape feature sets of the Stanford Classifier, please refer to the Javadocs included in the distributable in stanford-classifier-2008-04-18/javadoc/index.html and select the ColumnDataClassifier class. Stanford Classifier “Best” Features for Binary Classification Property Value Feature Name 5 useClassFeature 1.splitWordsRegexp 1.useSplitWords 1.useSplitWordPairs 1.useSplitFirstLastWords true [ ,] true true true 1.useSplitPrefixSuffixNGrams true 1.splitWordShape chris4 n/a n/a SW-str SWP-str1-str2 SFW-str SLW-str S#B-str S#E-str SSHAPE-Xx…xX SSHAPE-xx…xx SSHAPE-dx…xd SSHAPE-Xd…xX SSHAPE-dX…s…+d SSHAPE-d-…/X …and so on Note that the “1.” prefix in the property name reflect the column to which the feature is applied. The F1 scores for all testing performed on all iterations of Stanford Classifier built-in feature sets: Stanford Classifier F1 Scores for Binary Classification Feature Set POSITIVE NEGATIVE Baseline 0.737 0.759 chris1 0.802 0.790 chris2 0.822 0.809 chris4 0.822 0.809 dan1 0.730 0.743 dan2 0.808 0.793 jenny1 0.807 0.792 Despite the success of the classifier alone on a binary classification task, as the status update gets longer, misclassification seems to increase. In many cases, unigram features of a single article or preposition, have sentiments which are highly subjective based on their surrounding context. As hypothesized earlier, these longer status messages are in mostly sentence-style form. Thus, we expect that the POS tagger would be able to properly classify sentences such as these. We’ll present examples of this in the next section. Aware that an overabundance of positively or negatively labeled data in our training set could cause learning error, we did a quick analysis of the composition of both our training and test sets, as shown below. Ratio of Labeled Classes in Training Data Classes POSITIVE NEGATIVE Count 1715 1785 Ratio 49.00% 51.00% Ratio of Labeled Classes in Test Data Classes POSITIVE NEGATIVE Count 445 375 Ratio 54.27% 45.73% CLASSIFIER PLUS POS TAGGER The next set of features we turned to was POS tags. We used the Stanford POS tagger described above to provide POS tags for each word in the Facebook status message. The first set of features we considered were simply adding POS tags to each word in the sentence, increasing our F1 scores by ~5%. One benefit of this method is the disambiguation of certain words. For 6 example, tagging “scratch” as either “scratch_VB” or “scratch_NN” resulted in different words for the classifier to consider. This allowed the classifier to weight words positively or negatively, depending on their part of speech. Consider the following longer status update that the Classifier alone classified incorrectly, but was classified correctly when POS tags were included: Click "LIKE" if I ever made you smile in your life. Set this as your status and see how many people you have made smile xox Classifier alone Classifier with POS tags NEGATIVE POSITIVE … 1-SW-in 1-SSHAPE-xxxxx 1-SW-this 1-SW-your 1-SSHAPE-Xxx 1-SW-Click 1-SSHAPE-Xxxxx 1-SW-people 1-SW-you 1-SSHAPE-X 1-SW-smile … Prob: -0.25 -0.28 0.34 0.23 0.35 0.20 -0.41 0.33 -0.22 0.34 -0.72 0.25 0.28 -0.34 -0.23 -0.35 -0.20 0.41 -0.33 0.22 -0.34 0.72 0.66 0.34 NEGATIVE POSITIVE … 1-SW-this_DT 0.28 -0.28 1-SW-._. 0.26 -0.26 1-SSHAPE-xx_xXX 0.45 -0.45 1-SW-made_VBN 0.34 -0.34 1-SSHAPE-XxX_xXX -0.21 0.21 1-SSHAPE-xxX_xX$ 0.31 -0.31 1-SSHAPE-._. 0.26 -0.26 1-SW-smile_VBP -0.22 0.22 1-SSHAPE-X_XXX 0.28 -0.28 1-SW-you_PRP -0.42 0.42 1-SSHAPE-xxX_xXX 0.25 -0.25 1-SW-in_IN -0.30 0.30 1-SSHAPE-XXX_XX -0.34 0.34 1-SW-smile_NN -0.22 0.22 1-SW-have_VBP -0.28 0.28 1-SW-ever_RB -0.32 0.32 1-SW-many_JJ 0.24 -0.24 1-SWP-your_PRP$-life_NN -0.20 0.20 … Prob: 0.10 0.90 The Classifier alone heavily weighted words such as “this,” “Click,” “people,” and short capitalized words (such as “Set” and “Click”) as negative features, pushing the sentence slightly negative, despite its obvious positive sentiment. Once the words were POS tagged, however, the Classifier was able to determine the correct sentiment at a very strong level. The POS tags avoided the problem of particular words being weighted negative simply because they appeared in more negative samples. Consider another, simpler example. The Classifier alone incorrectly classified Cole made me breakfast as a negative sentiment, while the Classifier plus POS tags classified Cole_NNP made_VBD me_PRP breakfast_NN correctly. Classifier alone NEGATIVE CLASS 0.62 1-SWP-made-me 0.02 1-SW-me -0.27 1-SWP-me-breakfast 0.00 1-SSHAPE-xx -0.04 1-SW-breakfast -0.09 1-SLW-breakfast 0.00 1-SW-Cole 0.00 1-SW-made 0.10 1-SFW-Cole 0.00 1-SSHAPE-xxxxx -0.28 1-SSHAPE-xxxx 0.02 1-SWP-Cole-made 0.00 Classifier with POS tags POSITIVE -0.62 -0.02 0.27 0.00 0.04 0.09 0.00 0.00 -0.10 0.00 0.28 -0.02 0.00 NEGATIVE POSITIVE CLASS -0.71 0.71 1-SWP-Cole_NNP-made_VBD 0.00 0.00 1-SW-me_PRP -0.19 0.19 1-SW-breakfast_NN -0.15 0.15 1-SW-made_VBD -0.06 0.06 1-SFW-Cole_NNP 0.00 0.00 1-SSHAPE-xx_xXX 0.45 -0.45 1-SWP-made_VBD-me_PRP 0.03 -0.03 1-SLW-breakfast_NN 0.00 0.00 1-SSHAPE-xxX_xXX 0.25 -0.25 1-SW-Cole_NNP 0.00 0.00 1-SWP-me_PRP-breakfast_NN0.00 0.00 1-SSHAPE-XxX_xXX -0.21 0.21 7 1-SSHAPE-Xxxx Total: Prob: 0.18 0.25 0.62 -0.18 -0.25 0.38 1-SSHAPE-xxX_XX Total: Prob: 0.15 -0.45 0.29 -0.15 0.45 0.71 Here, we see that a word that had a relatively strong negative connotation, “made,” was given a positive connotation once it was assigned the VBD tag by the POS tagger. “Made me” retained its negative sentiment, likely due to the connotation of being forced to do something, but “made” itself become more positive. While “made” can have negative connotations, it can also have positive ones, such as when used in the context of creation instead of force. Also, the word shape “Xxxx” is deemed to be a negative feature. However, when the POS tags are added, there are more varieties of word shapes. The prevalence of Xxxx in negative status messages no longer incorrectly sways the status message to be classified as negative. In an attempt to improve our classification even further, we next added a class of features that looked at the overall grammatical structure of the sentence. We considered the number of nouns (N), the number of verbs (V), the number of adjectives & adverbs (A), the ratio of nouns to adjectives, the ratio of verbs to adverbs, the number of terminals, and the number of exclamation points (E). The table below reflects a small sampling of possible combinations of these features that we tested. The best results occurred when we included features for the number of nouns, the number of verbs, the number of adjectives & adverbs, the ratio of nouns to adjectives, and the ratio of verbs to adverbs. Trial Classifier alone +POS, no additional features +POS with NVA +POS with All +POS with Ratios +POS with Terminals & Ratios +POS with NVAA +POS with NVAE +POS with NVA & Ratios +POS with NVAE & Ratios F1 Score: Negative 0.809 0.857 F1 Score: Positive 0.822 0.884 0.860 0.860 0.857 0.855 0.886 0.886 0.884 0.882 0.860 0.856 0.863 0.859 0.886 0.882 0.888 0.885 By adding the NVA and ratios of nouns to adjectives and verbs to adverbs features, we observed a ~1% increase in the F1 score. Consider the example My work-husband's last day. This is a sentence that a human would easily classify as negative: the speaker’s “work-husband” (a designation typically reserved for a close friend at work) is leaving. Classifier + POS tags NEGATIVE POSITIVE CLASS -0.71 0.71 1-SW-My_PRP$ 0.09 -0.09 1-SSHAPE-'xX_XX -0.04 0.04 1-SLW-day_NN 0.04 -0.04 1-SSHAPE-XxX_X$ 0.09 -0.09 1-SSHAPE-xx-_xXX -0.06 0.06 1-SFW-My_PRP$ 0.05 -0.05 1-SSHAPE-xx_xXX 0.45 -0.45 1-SW-day_NN -0.33 0.33 1-SWP-'s_POS-last_JJ 0.05 -0.05 1-SW-'s_POS 0.06 -0.06 1-SW-last_JJ 0.05 -0.05 1-SWP-last_JJ-day_NN 0.22 -0.22 Classifier + POS tags + NVA + NEGATIVE 3-Value (#A) -0.02 5-Value (#V/#A) 0.05 6-SWP-last_JJ-day_NN 0.22 CLASS -0.69 6-SFW-My_PRP$ 0.05 1-Value (#N) -0.04 6-SSHAPE-xx-_xXX -0.02 6-SW-last_JJ 0.05 2-Value (#V) 0.09 4-Value (#N/#A) 0.01 6-SWP-'s_POS-last_JJ 0.05 6-SW-My_PRP$ 0.09 6-SLW-day_NN 0.04 ratios POSITIVE 0.02 -0.05 -0.22 0.69 -0.05 0.04 0.02 -0.05 -0.09 -0.01 -0.05 -0.09 -0.04 8 Prob: 0.49 0.51 6-SSHAPE-xx_xXX 6-SSHAPE-'xX_XX 6-SW-day_NN 6-SW-'s_POS 6-SSHAPE-XxX_X$ Prob: 0.47 -0.05 -0.30 0.09 0.09 0.59 -0.47 0.05 0.30 -0.09 -0.09 0.41 Here, we see the Classifier plus POS tags alone fails because of the strongly positive connotation of the word “day.” However, the additional features added by the NVA and ratios were enough to pull the sentence from very slightly positive to the correct classification of negative. Here, the number of verbs (0.0), ratio of verbs to adverbs (0.0) and ratio of nouns to adjectives (2.0) suggested the sentence should be negative. The lack of descriptive words provided the Classifier with valuable insight into the sentiment of the sentence. CLASSIFIER PLUS LDA As stated, the motivation for incorporating a Gibbs LDA, as used by the Stanford Topic Modeling Toolkit, is to employ an unsupervised learning algorithm to discover latent features within our training data. Modifying a traditional LDA like Gibbs, such that the number of classes assigned to words within the status update is constrained to a fixed set of labels, is known as a Labeled LDA, and what we have referred to throughout this paper as simply “LDA.” The output of the Stanford LDA maps a relative fractional count of the words in an update that were assigned to the POSITIVE label, and those mapped to the NEGATIVE label. For example, a training file with LDA output (fbs.binomial.LDA.train) might contain a line such as: NEGATIVE 0.2 6.8 10.5 hours working at graduation = sun burnt + sore throat + exhaustion = Party Pooper With only two labels, POSITIVE and NEGATIVE, the LDA output in columns 2 and 3 above indicates that 0.2 words in the status update were assigned to the NEGATIVE label, and 6.8 were assigned to the POSITIVE label. These two real-valued columns became real valued features for our classifier to use. Combing the LDA and classifier resulted in the following feature set. Stanford Classifier + LDA “Best” Features (Binary Classification) Property Value 1.realValued true 1.binnedValues -1.0, 0.0, 5.0, 10.0, 15.0, 100.0 1 .binnedValuesStatistics true 2.realValued true 2.binnedValues -1.0, 0.0, 5.0, 10.0, 15.0, 100.0 2 .binnedValuesStatistics true 3.useClassFeature true 3.splitWordsRegexp [ ,] 3.useSplitWords true 3.useSplitWordPairs true 3.useSplitFirstLastWords true 3.useSplitPrefixSuffixNGrams true 3.splitWordShape chris4 For columns 1 and 2 above, the “binnedValues” feature of the Stanford Classifier prevents data scarcity by discretizing the LDA output into ranged bins. “binnedValuesStatistics” tells the Classifier to log data regarding the collection sizes of the bins for our later analysis. Like the Classifier, the LDA also has customizable features. Unfortunately, LDA did not perform to our expectations. The default LDA features often rendered many status updates unreadable by the LDA. Of the 3,500 status updates used in the training set, the LDA returned metrics for only 9 1,446 of them, less than 40% of the total. In attempting to give the LDA more data to work with, we implemented the “relaxed LDA feature set” below, assuming the elimination of DocumentMinimumLengthFilter would lead to a wealth of usable data. However, elimination of this feature in particular resulted in the LDA throwing a runtime exception. Default LDA features // indicates status update is in col 2 Column(2) // lowercase everything CaseFolder // tokenize on spaces characters SimpleEnglishTokenizer // ignore non-words and non-numbers WordsAndNumbersOnlyFilter // collect counts (needed below) TermCounter // take terms with >=3 characters TermMinimumLengthFilter(3) // filter terms in <4 docs TermMinimumDocumentCountFilter(4) // filter out 30 most common terms TermDynamicStopListFilter(30) // take only docs with >=5 terms DocumentMinimumLengthFilter(5) Relaxed LDA feature set // indicates status update is in col 2 Column(2) // tokenize on spaces characters SimpleEnglishTokenizer // collect counts (needed below) TermCounter // take terms with >=3 characters TermMinimumLengthFilter(3) // filter terms in <4 docs TermMinimumDocumentCountFilter(4) // take only docs with >=5 terms DocumentMinimumLengthFilter(5) The distribution of bins of LDA data shows heavily skewed 0-counts, as shown below: BinnedValuesStatistics for column 1 Val-(-1.0,0.0] = {POSITIVE=1109.0, NEGATIVE=1068.0} Val-(0.0,5.0] = {POSITIVE=564.0, NEGATIVE=293.0} Val-(10.0,15.0] = {POSITIVE=4.0, NEGATIVE=82.0} Val-(15.0,100.0] = {POSITIVE=1.0, NEGATIVE=24.0} Val-(5.0,10.0] = {POSITIVE=37.0, NEGATIVE=318.0} BinnedValuesStatistics for column 2 Val-(-1.0,0.0] = {POSITIVE=994.0, NEGATIVE=1180.0} Val-(0.0,5.0] = {POSITIVE=262.0, NEGATIVE=532.0} Val-(10.0,15.0] = {POSITIVE=101.0, NEGATIVE=8.0} Val-(15.0,100.0] = {POSITIVE=40.0, NEGATIVE=1.0} Val-(5.0,10.0] = {POSITIVE=318.0, NEGATIVE=64.0} Observe from the example above that LDA only recognized 7 words despite there being at least 11 legitimate words. In an attempt to improve the word count parsed by the LDA, we wanted to implement greater flexibility in word tokenization than the simple “space” used by the SimpleEnglishTokenizer. After consulting with Evan Rosen, one of the Stanford Topic Modeling Toolbox developers, he acknowledged that the LDA is not able to handle some input characters, only implements a couple of the scalanlp.text.tokenize methods, and remains a primarily beta effort. The final result of LDA classification of binary sentinment for Facebook status updates is disappointing. Although we ran a number of feature and iteration combinations, none of them showed an improvement, even over the MaxEnt classifier alone. Since Classifier feature sets “chris2” and “chris4” had performed equally before, we ran both sets with the LDA to determine differentiation, if any. The final results, highlighting our best overall binary classification model overall, are below. Trial F1 Score: F1 Score: 10 Classifier alone +LDA (defaults, chris2) +LDA (defaults, chris4) +LDA (defaults, binned, chris4) +LDA (defaults, 5k, chris2) +LDA (defaults, 5k, chris4) +LDA (relaxed, chris2) + LDA (relaxed, chris4) + LDA (relaxed, 5k, chris2) + LDA (relaxed, 5k, chris4) +POS with NVA & Ratios +POS with NVA & Ratios + LDA Negative 0.809 0.670 0.670 0.665 0.669 0.669 0.637 0.638 0.630 0.630 0.863 0.753 Positive 0.822 0.667 0.667 0.659 0.667 0.667 0.600 0.602 0.587 0.587 0.888 0.773 Due to the errors we encountered with the LDA itself, we’re unable to make any assertive conclusions. Is the serious degradation of scores simply implementation error, or is the use of an LDA ultimately of no benefit in the context of our sentiment analysis topic? FOUR CLASSES: Happy, Unhappy, Skeptical, Playful While binary classification can be helpful when your objective is general trending analysis, deeper semantic extraction requires more granular sentiment analysis. To say that a person feels more positive about some thing than negative conveys little understanding of the nature of the positivity. Is there happiness, or merely tolerance? Is there playfulness, or merely acknowledgment? For this project, we chose to replace our binary classification task with a fourclass classification of happy, unhappy, skeptical, and playful sentiments. As discussed previously, these were primarily chosen based on the amount of data in those emoticon categories to generate sufficient training and test set sizes. As before, each of the four classes was truncated to provide a roughly even distribution of each across training and test sets. Having experimentally derived a “best model” on the binary classification task, our objective in this section focused on determining whether that model scales to the harder task of multi-label classification. For good measure, we did run LDA again because we were curious to see if multiple classes would showcase its modeling strengths, perhaps indicating LDA was inherently weaker with fewer class labels to which it could assign words. CLASSIFIER Using the same set of “best features” as with the binary classification task, our baseline F1 scores using only the MaxEnt classifer are below. Although “chris4” showed a 0.017 net degradation versus “chris2”, we decided to stick with “chris4” throughout the remainder of our testing. Stanford Classifier F1 Scores for Multi-Label Classification Feature Set HAPPY PLAYFUL SKEPTICAL UNHAPPY binary baseline 0.737 (POSITIVE) 0.759 (NEGATIVE) For prior reference binary best 0.888 (POSITIVE) 0.863 (NEGATIVE) chris2 0.480 0.496 0.736 0.667 chris4 0.476 0.496 0.731 0.659 Consistent with our expectation, the multi-label classification task proved more difficult than the simpler binary classification task. Classes Count Ratio Ratio of Labeled Classes in Training Data HAPPY PLAYFUL SKEPTICAL 843 831 813 25.54% 25.18% 24.63% UNHAPPY 813 24.63% 11 Ratio of Labeled Classes in Test Data HAPPY PLAYFUL SKEPTICAL 60 72 90 19.23% 23.08% 28.85% Classes Count Ratio UNHAPPY 90 28.85% Although the test data contains slightly higher proportions of “skeptical” and “unhappy” data, the training set maintains a relatively even distribution such that we aren’t concerned about training bias toward that data. CLASSIFIER PLUS LDA Having been previously disappointed with LDA results during binary classification, we were optimistic that perhaps the inclusion of more labels to classify would give the LDA some sense of purpose. We ran tests comparing the same default LDA features and relaxed LDA features as described in the binary classification section, as well the default of 1,500 iterations versus 5,000 iterations. In addition, we decided to perform an intermediate experiment to determine whether LDA results would improve if augmented with POS tags and only LDA features (e.g. no POS features). The net result is somewhere in-between, better than the classifier-only, but far worse than the combination of POS and classifier. The results of these tests, along with an LDA-only baseline matching the features from the binary LDA tests, are summarized below. In the sections which follow, LDA discussions are assumed to include POS-tagged data. Trial Classifier-only +LDA (no POS, best) +LDA (relaxed, chris2) +LDA (relaxed, chris4) +LDA (5k,relax,chris2) +LDA (5k,relax,chris4) +LDA (default, chris2) +LDA (default, chris4) +LDA (binned, chris4) +LDA (default,5k,chris2) +LDA (default,5k,chris4) F1 Score: Happy 0.476 0.392 0.451 F1 Score: Playful 0.496 0.500 0.614 F1 Score: Skeptical 0.731 0.637 0.847 F1 Score: Unhappy 0.659 0.536 0.547 0.451 0.614 0.847 0.547 0.454 0.650 0.857 0.554 0.451 0.644 0.857 0.560 0.453 0.655 0.880 0.565 0.453 0.655 0.876 0.557 0.416 0.605 0.859 0.485 0.471 0.650 0.886 0.559 0.471 0.650 0.886 0.559 Interestingly, the probability mass shifts from “happy” and “unhappy” to “playful” and “skeptical” when using LDA and POS tags (no POS features). The overall improvement is still not what we had hoped for, but does reflect an 8% improvement versus the classifier-only. While we’d like to give LDA credit for doing its job, we suspect the increase is more attributable to the POS tags rather than the number of labels. Nonetheless, considering that LDA alone generally degrades classifier-only results, the improvement with POS tags is noteworthy. A comparison of LDA versus classifier-only results also reveals the reason for higher “skeptical” and “unhappy” scores: the distinction between playful and happy is subtle, and both have low confidence when distinguishing between the two. One might argue the same relationship between skeptical and unhappy, but the semantics are more polarized in the latter case. In the 12 former, there is arguably a much higher correlation between playfulness and a state of happiness. Consider the following “playful” example: lol, you gotta see this! w=HtTp://u.MavRev.cO09dp2. The classifier gives this status update a “skeptical” label, which is clearly wrong. The LDA, on the other hand, gives it a “happy” label. This is still wrong, but at least LDA gets closer to the underlying sentiment expressed by the update. LDA Classification example: lol, you gotta see this! w=HtTp://u.MavRev.cO09dp2 Classifier-only Classifier with LDA HAPPY PLAYF SKEPT UNHAP … 1-SW-this! -0.46 0.16 0.43 -0.12 1-SWP-see-this! -0.24 0.01 0.35 -0.13 1-SWP-you-gotta -0.16 -0.11 0.33 -0.06 1-SW-you 0.00 0.53 0.23 -0.75 1-SWP-gotta-see -0.18 -0.06 0.31 -0.07 1-SWP--you 0.01 -0.32 0.33 -0.01 1-SW-lol -0.21 0.23 0.08 -0.10 … Prob: 0.22 0.18 0.55 0.05 Sentence Prob for SKEPT: 0.5473878585366191 HAPPY PLAYF SKEPT UNHAP … 1-Value 0.79 0.14 0.02 -0.25 2-Value -0.10 0.95 0.37 -0.27 3-Value -0.01 0.63 0.88 -0.51 4-Value 0.16 0.03 0.11 0.81 5-SW-you_PRP 0.03 0.39 -0.10 -0.32 5-SSHAPE-xx_xXX 0.31 -1.77 0.01 1.37 5-SW-this_DT -0.37 0.52 0.18 -0.33 5-SW-lol_NN 0.01 0.22 -0.03 -0.21 5-SW-!_. 0.41 0.29 -0.21 -0.50 5-SSHAPE-!_. 0.41 0.29 -0.21 -0.50 5-SSHAPE-=_XXX 0.23 -0.21 0.39 -0.41 5-SSHAPE-x_XX 0.04 -0.73 0.40 0.29 … Prob: 0.42 0.39 0.17 0.01 Sentence Prob for HAPPY: 0.6659759502662165 In the classifier-only classification, the features which include “this!” seem strangely mapped to “skeptical”. Looking at other status updates with “this!”, it turns out we encounter other examples, such as “Guys! You have to see this! Disney's most shocking hidden message!...”, where the containing sentence is labeled as “skeptical.” “You gotta see” might understandably be associated with skepticism, and “lol” is the only feature we would expect to associate with “playful” that actually is. At least we take comfort that the classifier is not very confident of its labeling, with a likelihood of only 54.7% Looking at the LDA feature weights, the “1-Value”, “2-Value”, “3-Value”, and “4-Value” feature names reflect the LDA word-label ratio counts, mapping to class labels PLAYFUL, SKEPTICAL, UNHAPPY, and HAPPY, respectively. Tellingly, “playful” (1-Value) is strongly associated with “happy”, with “playful” only a distant second association. Similarly, “unhappy” (3-Value) is most closely associated with “skeptical” but, strangely, also surprisingly correlated with “playful”. Stranger still, “skeptical” (2-Value) is also most strongly correlated to “playful”, with its true association to itself a distant second. And “happy” (4-Value) is most strongly correlated with “unhappy,” with itself second. The classifier struggles against this weirdness but is helped by the POS tagger. “this” is separated from “!” by the tagger, resulting in a significant shift of probability mass from “skeptical” to “happy” and “playful”. Classifier word shape features seem to interpret POS tags themselves as being closely correlated to “skeptical,” and “playful” tagging is heavily penalized by the presence of xx_xXX word shapes, although it’s not apparent what word that corresponds to. Most problematic, however, is that “!” is more strongly associated with “happy” than with “playful”. The “!” can be explained away as a likely symptom of the training samples containing an unexpected prevalence of “!” in “skeptical” labels. LDA must be given credit for a being a factor for shifting probability mass toward “happy” and “playful”, as described in the preceding paragraph, and the POS tagger for allowing us to learn the difference between “!” and “this!”. We can also take comfort in the fact that the classifier at least acknowledged its uncertainty, scoring the labeling with 66.5% likelihood In hindsight, the same benefit induced by POS 13 tagging could have been realized by the classifier alone with letter shape features around sentence and word endings. CLASSIFIER PLUS POS TAGGER The process for adding POS tags was exactly the same as in the binary case above, but this time we observed F1 score improvements ranging from 2% to 20%, depending on the class. Consider the following “unhappy” sentence that the Classifier alone classified incorrectly as “happy,” but was classified correctly when POS tags were included: lost photos from Paris stupid replace command. Classifier alone CLASS 1-SW-stupid 1-SW-from 1-SW-lost 1-SW-photos 1-SW-Paris 1-SWP-photos-from 1-SWP-Paris1-SW-replace 1-SSHAPE-xxxxx 1-SWP--stupid 1-SSHAPE-xxxx 1-SSHAPE-Xxxxx 1-SW1-SSHAPEProb: Classifier with POS tags HAPPY PLAYF SKEPT -0.50 0.90 1.05 0.23 -0.19 -0.11 -0.19 0.01 0.24 -0.42 -0.23 -0.25 -0.01 0.15 0.26 -0.11 -0.05 0.03 0.15 -0.16 0.08 -0.10 -0.01 -0.02 -0.06 -0.00 -0.00 0.84 -1.16 -0.83 0.21 -0.09 -0.03 0.07 -0.29 -0.32 1.01 -0.60 -0.10 0.10 -0.09 -0.05 0.10 -0.09 -0.05 0.44 0.02 0.10 UNHAP -0.91 0.06 -0.05 0.90 -0.40 0.13 -0.07 0.14 0.06 1.03 -0.09 0.56 -0.16 0.06 0.06 0.44 HAPPY PLAYF SKEPT CLASS 0.43 1.48 -1.60 1-SW-from_IN 0.02 -0.04 -0.21 1-SSHAPE-xx_xXX 0.38 -1.69 0.10 1-SW-photos_NNS 0.13 0.06 0.04 1-SW-lost_VBN 0.02 -0.02 -0.10 1-SSHAPE-xxX_xXX -0.03 -0.35 -0.39 1-SSHAPE-XxX_xXX 0.49 -0.40 -0.06 1-SW-stupid_JJ 0.23 -0.14 -0.06 1-SWP-photos_NNS-from_IN 0.17 -0.12 0.05 1-SW-Paris_NNP -0.10 -0.03 -0.01 Prob: 0.46 0.02 0.01 UNHAP -0.49 0.23 1.37 -0.23 0.11 0.76 0.09 -0.03 -0.10 0.14 0.51 Here, the word shape features without the POS tags causes the classifier to classify the sentence as happy, despite its obvious unhappy sentiment. The addition of the POS tags removes those errant features and replaces them with other features that are slightly more rare and so better equipped to determine sentiment. As described above, we next added the same set of POS tag class of features that looked at the overall grammatical structure of the sentence. The table below reflects a small sampling of possible combinations of these features that we tested. The best results occurred when we included features for the number of nouns, the number of verbs, the number of adjectives & adverbs, and the number of exclamation marks. Trial Classifier alone +POS, no add’l features +POS with NVA +POS with All +POS with Ratios +POS with Terminals & Ratios +POS with NVAA +POS with NVAE +POS with NVA & Ratios +POS with NVAE & Ratios F1 Score: Happy 0.476 0.503 F1 Score: Playful 0.496 0.661 F1 Score: Skeptical 0.731 0.954 F1 Score: Unhappy 0.659 0.656 0.528 0.514 0.500 0.510 0.656 0.624 0.645 0.619 0.960 0.954 0.876 0.949 0.681 0.674 0.543 0.652 0.524 0.521 0.503 0.640 0.677 0.624 0.960 0.960 0.960 0.674 0.685 0.670 0.503 0.624 0.960 0.670 14 By adding the NVAE features, we observed a 1% – 3% increase in the F1 score. Consider the example thinks he has the flu!. This is a sentence that a human would easily classify as unhappy: no one wants the flu, but could easily confuse a classifier given it appears “happy” on paper. Classifier + POS tags Classifier + POS tags + NVAE HAPPY PLAYF SKEPT UNHAP CLASS 0.43 1-SW-flu_NN -0.19 1-SLW-!_. 1.21 1-SWP-he_PRP-has_VBZ 0.25 1-SFW-thinks_VBZ -0.08 1-SW-the_DT 0.38 1-SWP-has_VBZ-the_DT 0.17 1-SSHAPE-xx_xXX 0.38 1-SW-!_. 0.37 1-SWP-flu_NN-!_. -0.05 1-SW-thinks_VBZ-0.16 1-SSHAPE-xxX_xXX -0.03 1-SW-he_PRP -0.04 1-SW-has_VBZ -0.78 1-SSHAPE-xxX_XX -0.12 1-SSHAPE-!_. 0.37 1.48 -1.60 -0.49 -0.24 -0.20 0.64 -0.48 -0.79 0.07 -0.04 -0.04 -0.16 0.24 -0.04 -0.12 -0.11 -0.26 -0.01 -0.10 -0.05 -0.03 -1.69 0.10 1.37 0.31 -0.18 -0.47 -0.00 -0.00 0.06 0.54 -0.10 -0.28 -0.35 -0.39 -0.18 0.06 0.06 0.05 0.76 0.16 0.67 -0.05 0.05 0.12 0.31 -0.18 -0.47 HAPPY PLAYF SKEPT UNHAP 3-Value (#A) 0.04 5-SWP-he_PRP-has_VBZ 0.21 5-SW-!_. 0.37 5-SSHAPE-!_. 0.37 CLASS 0.46 5-SW-he_PRP 0.01 5-SW-has_VBZ -0.70 5-SWP-flu_NN-!_. -0.06 5-SSHAPE-xx_xXX 0.32 5-SSHAPE-xxX_XX -0.06 1-Value (#N) 1.04 5-SWP-has_VBZ-the_DT 0.14 5-SFW-thinks_VBZ -0.06 5-SW-flu_NN -0.22 4-Value (#E) -0.29 2-Value (#V) 0.03 5-SW-the_DT 0.41 5-SSHAPE-xxX_xXX 0.07 5-SLW-!_. 1.26 5-SW-thinks_VBZ -0.13 0.29 -0.04 0.20 0.20 1.45 -0.20 0.03 0.15 -0.03 -0.04 -0.20 -0.20 -1.56 0.09 0.12 -0.13 -0.46 -0.46 -0.36 0.11 0.56 -0.00 -0.00 -1.74 0.07 0.06 1.44 -0.05 1.05 0.09 0.96 0.03 0.84 -0.10 -0.06 0.01 0.23 -0.23 -0.12 0.24 -0.18 -0.03 -0.14 -0.20 0.64 -0.21 -0.27 0.10 0.53 -0.23 -0.00 -0.37 -0.30 -0.55 -0.80 0.60 0.11 0.54 -0.09 -0.32 Here, we see the Classifier plus POS tags alone fail because of the strongly positive connotation of the exclamation point, selecting happy with a probability of 0.542. However, the additional features added by the NVAE features are enough to pull the sentence from very slightly happy to the correct classification of unhappy, selecting the proper sentiment with a probability of 0.500. Here, the number of nouns (1.0), and the number of verbs (2.0) suggested the sentence should be unhappy. Error Analysis “Whenever I go on a ride, I’m always thinking of what’s wrong with the thing and how it can be improved.” –Walt Disney DATA COLLECTION Ideally, we wanted to collection 1,000,000 raw status updates. We knew going in that we would label via emoticon, and assumed a signal-to-noise ratio of 90%. We anticipated duplicates as a result of shared friends by using the sort –u(nique) Linux command. However, we neglected to consider that many Facebook users re-post their status updates with only slight changes, or that some status updates are common among many different users. For example, one user posted a succession of the following status updates: 2 days 2 days ! 2 days!!!! 2 days baby!! 15 On major holidays, many similar status messages can be found, such as these from many different users: happy mother's day to all the mothers out there Happy Mother's Day to all the moms out there... Happy Mother's Day to all my mommy friends out there! Happy Mothers Day Happy mother's day to all the mommies out there Contrast the above with a sentence like “I’m not happy!”: would this be classified correctly? We chose to retain the distinction of capitalization, as well as punctuation, but this could also hurt us if frequent repetition in a certain sentiment inappropriately influences our feature weights. In this specific example, it turns out that our model does appropriately label “I’m not happy!” as NEGATIVE, with 60.2% confidence. Understandably, “not” is heavily correlated with “negative”, assigned a weight of 1.26. Another example, “2 days until I die!”, does not fare so well. This time, the classifier declares that sentence to be “positive,” with 91.2% confidence. Surprisingly, it’s not because of the word “die,” which is assigned no weight. Instead, “2 days” associates strongly with “positive,” as do words ending with “!”. Would we have classified this correctly had we implemented better filtering among updates with common word prefixes and sentiments? We propose a solution to this in the “Pre-processing” section of Conclusions and Future Work. PREPROCESSING Although we had a decent-sized raw corpus, we had no time to hand label or verify that all of the data were properly labeled by emoticons. We spot checked a few samples for sanity and were generally pleased with the results. Still, the fact that emoticons served as our sentiment barometer also eliminated what could have been an important learned feature. Imagine a wordor letter-shape feature that models emoticons, or even features for specific emoticons. Lack of time to hand-label our data probably negatively impacted our results. Further, the need to remove the emoticons once labeled led to complicated regular expression filtering that was not foolproof. In some cases, we did observe emoticons that snuck their way into sentences. In particular, we observed a handful of sentences labeled as NEGATIVE that had had :( removed, but not :). Sufficient time to hand-label would have eliminated this error altogether. Additionally, we assumed garbage strings such as URLs would be a source of error. For example, many status updates we collected had URLs in them. In turns out that sparsity actually helps here. Since nearly every link is unique, feature weights turn out to be 0. Similarly, we tokenized only on space and comma characters when, perhaps, we should have tokenized on all characters which provide no sentiment insight. We also focused on word shape features and did not include any letter shape features. Such a feature may have given more weight to borderline cases. Letter shape features may also have been able to model word-based prefixes and suffixes. BINARY Despite the successes in the binomial case, the Classifier + POS tags + other POS features still misclassified many sentences. Consider the example internet working again. 3-Value 5-Value 6-SLW-again_RB CLASS 1-Value 6-SW-working_VBG 6-SW-again_RB 6-SW-internet_NN NEGATIVE -0.02 0.05 0.24 -0.69 -0.04 0.64 0.21 0.15 POSITIVE 0.02 -0.05 -0.24 0.69 0.04 -0.64 -0.21 -0.15 16 2-Value 4-Value 6-SSHAPE-xxX_xXX 6-SSHAPE-xx_xXX Prob: 0.09 0.01 0.19 0.47 0.93 -0.09 -0.01 -0.19 -0.47 0.07 internet_NN working_VBG again_RB POSITIVE NEGATIVE 0.9307304154458833 The predominant negative feature here is still the word “working.” Adding POS tags and features do not teach the Classifier that this sentence is positive. While working is generally considered negative, in the case of the Internet, most users prefer it to be working! Context, such as knowing working relates to the Internet as opposed to a job, may help improve examples such as this. We considered using the Stanford Parser to incorporate additional grammatical features, but found by spot checking examples that the Stanford Parser, trained on PTB data, did not do a great job of determining dependencies for Facebook status messages. This makes sense, considering the type of language used in the WSJ is very different from the type of language used on Facebook. To correct this issue, a new training corpus would need to be created by parsing Facebook status messages. MULTICLASS The multiclass case also has room for improvement. Consider the example looks like im waking up early next semester. 3-Value CLASS 5-SW-early_RB 5-SSHAPE-xx_xXX 5-SW-next_JJ 5-SSHAPE-xx_XX 5-SWP-looks_VBZ-like_IN 2-Value 5-SSHAPE-xxX_xXX 5-SW-semester_NN 5-SW-im_NN 1-Value 5-SW-up_RB 5-SW-looks_VBZ 4-Value 5-SW-waking_VBG 5-SWP-up_RB-early_RB 5-SW-like_IN Prob: HAPPY 0.04 0.46 -0.12 0.32 -0.19 -0.03 -0.06 0.03 0.07 0.07 -0.36 1.04 -0.04 0.15 -0.29 -0.07 -0.01 -0.24 0.04 PLAYFUL 0.29 1.45 -0.19 -1.74 0.17 -0.07 0.03 0.24 -0.37 -0.08 -0.03 1.05 -0.02 -0.01 -0.12 -0.03 -0.01 -0.03 0.03 SKEPTICAL 0.15 -1.56 -0.19 0.07 0.37 -0.22 -0.01 0.10 -0.30 -0.05 0.41 0.96 -0.14 -0.06 -0.21 -0.01 -0.01 -0.12 0.01 UNHAPPY -0.03 -0.36 0.50 1.44 -0.35 0.29 0.04 0.53 0.60 0.07 -0.02 0.84 0.19 -0.08 -0.27 0.11 0.04 0.39 0.92 looks_VBZ like_IN im_NN waking_VBG up_RB early_RB next_JJ semester_NN SKEPTICAL UNHAPPY 0.913034316024457 Here, the number of nouns in the status message (2.0), and the words “next” and “im,” pull the sentence toward unhappy, even though the author used a skeptical (:-/) smiley. There is a very fine line here between whether this is truly a skeptical message or an unhappy message. Even a human would struggle to classify this status message. Conclusions and Future Work If the long-term vision of this project is sentiment classification of Facebook status updates, with a correlation to topic, we’ve barely scratched the surface of work to be done in pursuit of such a goal. Within the broader vision, we constrained the scope of this project to evaluation of how 17 well the “best” binary classification model would scale to the harder problem of multi-label classification. Would LDA be the key, would POS, would feature sets, or would it be some combination of all of these? We’ve shown that, in fact, our best model does scale to the mutliclass task in the sense that a similar set of features proved to result in the best classification results for both problems. Facebook status updates, more generously capped at 420 characters, benefited tremendously from POS tagging given their more sentence-like structure. Surprisingly, LDA only hurt our results but we acknowledge that, due to inherent problems with the LDA code itself, these results are ultimately inconclusive. Of course, with an increased number of labels comes increased uncertainty. While the models scaled in terms of relevant feature sets, misclassification rose, as expected. To mitigate this, we note the following areas in which our work could be extended and improved: • • • • • Semantic Analysis: The next major phase of this project would be to classify the topic of status updates. Correlating sentiment and topic, particularly in the use of a parser to determine noun or verb of interest, would allow for true semantic analysis. We believe this has wide-ranging use for marketing/adware applications and sociological data mining for policy makers and world organizations Data Collection: As mentioned in error analysis, hand-labeling our Facebook corpus rather than using emoticons as proxy for sentiment would allow us to use emoticons as features instead. This would remove a source of potential error (multiple smileys or “fake smileys”) and replace it with what would likely be a very strong set of features. Hand-labeling would also have allowed us to use much more of the over 62,000 status updates we collected, resulting in more reliable model performance data Labeling: Another angle to consider is our very approach to labeling. For example, our multiclass labeling actually resulted in 4,382 "happy" sentences, 1,284 "playful" sentences, 1,254 "skeptical" sentences, and 903 "unhappy" sentences. To prevent learning from being unduly influenced by one sentiment over another, we capped each of these at 903 in the training set. But was that the correct approach? Of the total usable samples (e.g. ones we could auto-label using emoticons), is it not valuable to model the natural distribution of sentiments? Is it the case that people are more likely to update their Facebook status when they are happy about something than when they are unhappy? Are they more likely to distance themselves from Facebook when unhappy? With more time, it would have been an interesting experiment to use this distribution for the classifier, LDA, or both. It could have had a significant impact, especially in low-confidence examples. We could also have employed Laplace or other technique to smooth this distribution, creating the possibility for "unseen" label class. Pre-processing: Implementing smarter regular expression filtering to weed out nonalphanumeric characters that do not convey sentiment would increase the amount of relevant data and possibly improve our results. Also, providing a more advanced method for removing duplicates, such as using an alignment model to remove the two or three highest scored “translations” of the current status update avoids the problem of overweighting certain words/features. Classifier: We explored only the word feature sets provided as part of the Stanford Classifier package. But the source is all there to add a custom feature set. It’s possible our feature wish-list could have been implemented via character-level shapes instead of creating new features. In our analysis of LDA performance in the multi-label classification task, we observed LDA correlations that model inherent ambiguities in the labels, such as between “happy” and “playful.” One idea is to create a feature that represents the joint over two or more LDA features. In the LDA case, the misclassification may have been reduced had we been able to consider “happy” and “playful” together, or even “happy” and “unhappy,” in the hopes of polarizing the feature weight assigned to these two. 18 • • • LDA: Clearly, the LDA results observed here bring down the overall score of our model substantially. As described, one restrictive feature could not be removed due to an exception thrown. The LDA does not implement RegExp tokenizers. Also, the data sparsity problem results in highly skewed label associations, increasing misclassification. Fixing these issues would hopefully result in useful features from the LDA, improving our F1 scores. Feature Engineering: Most of the features used in our “best” model may be considered unigram or, at the most, bigram in nature. In addition to the joint over LDA result columns proposed above, it might be similarly useful to compute the joint over both POS features and LDA features. For example, are HAPPY updates more likely to have a higher count of adverbs? Are SKEPTICAL updates more likely to have higher word and nouns counts? The Stanford Classifier grouping features may have been able to model this, but we did not have time to explore this capability. Parser: Training on Facebook data instead of PTB data would allow the Stanford Parser to provide accurate dependency data and open up more options for possible features. The language used in the WSJ is so different from the language used on Facebook, that the dependencies from the Stanford Parser were not accurate enough to justify using. References Dwergs. What’s the maximum length of a Facebook status update? February 19, 2009. http://reface.me/status-updates/whats-the-maximum-length-of-a-facebook-status-update/ Eldon, Eric. ComScore, Quantcast, Compete, Nielsen Show a Strong December for Facebook Traffic in the US. January 22, 2010. http://www.insidefacebook.com/2010/01/22/comscore-quantcast-compete-show-a-strongdecember-for-facebook-traffic-in-the-us/ Go, Alec; Huang, Lei; Bhayani, Richa. Twitter Sentiment Analysis. Stanford University, Spring 2009, CS224N Natural Language Processing, Professor Chris Manning. http://nlp.stanford.edu/courses/cs224n/2009/fp/3.pdf Harris, Jonathan; Kamvar, Sep. We Feel Fine. August 2005. http://www.wefeelfine.org O’Connor, Brendan. Balasubramanyan, Ramnath; Routledge, Bryan R.; Smith, Noah A. From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series. Carnegie Mellon University, 2010. http://anyall.org/oconnor_balasubramanyan_routledge_smith.icwsm2010.tweets_to_polls .pdf Pang, Bo; Lee, Lillian. Opinion Mining and Sentiment Analysis. Cornell University, 2008. http://www.cs.cornell.edu/home/llee/omsa/omsa-published.pdf Ramage, Daniel; Dumais, Susan; Liebling, Dan. Characterizing Microblogs with Topic Models. Stanford University. 2010. http://nlp.stanford.edu/pubs/twitter-icwsm10.pdf Ramage, Daniel; Hall, David; Nallapati, Ramesh; Manning, Christopher D. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. Stanford University, 2009. http://www.stanford.edu/~dramage/papers/llda-emnlp09.pdf 19