Sentiment Analysis – Opinion Mining for the Technical Generation By: Bryan Dickens ENGL 202C Section 14 3/28/14 Audience and Scope The realistic objective is to train future readers who will be involved in textual data analysis on how sentiment analysis can be applied to their text data to generate an output of sentiment that can be further used in correlation studies within their research. Specifically this audience will entail the student(s) who plan to replace my position in Dr. Tucker’s D.A.T.A. laboratory, as it will catch them up to speed in how much they must understand about the text data mining and applications that are being done with the sentiment data. It is expected that the audience will have some prior knowledge of machine learning, the general gist of how data mining works, and proficient prior coding experience. They will most likely be underclassmen in college in an engineering or science major. The document is designed to be relatively technical though still low-level in specific functional details as they are not graduate students with this as their only focus of study. The layout plan of this document is to introduce sentiment analysis and its’ uses, then elaborately explain the process with details and visual descriptions. This is followed by future development and then a synopsis of what has been covered in a conclusion. What is Sentiment Analysis? Figure 1. What exactly is Sentiment Analysis? Sentiment Analysis is a relatively new field of a larger computer science machine learning category called data mining. Sentiment mining incorporates the ability to gather large amounts of consumer textual data and to break down the positive or negative sentiment of each text post. This sentiment value is most specifically used for further correlations to any number of other features that could be of interest. A great example done previously by Mike Thelwall is his experiment “Sentiment Strength Detection in Short Informal Text” in which he takes small informal text from MySpace comments and gets accurate sentiment classifications on the user’s emotions when they wrote the text. 2 Finding a polarity summary of text is a strong data value for an end user, as examined by Das Amitava in his paper “Sentiment Analysis: What is the End User’s Requirement”. He stated that sentiment analysis of opinion sentences on any review site allows the site to take the summary of the sentiment to cater better to that specific site-viewer. As well he shows the importance of temporal change graphs and how the tracking of sentiment over time can reveal useful data. The picture below (Figure 2) shows a sentiment polarity graph over time of book blog posts where we can see a strong peak on July 16th, 2005 which happened to be the Harry Potter book release date. Figure 2. A graphical example of sentiment tracking over time with the release of the new Harry Potter book. The Sentiment Analysis Process To break down the process it is best shown through a diagram along with a step by step description of the different parts of the process. Below (Figure 3) is an example done in a recent research paper by Akshi Kumar and Teeja Sebastian in “Sentiment Analysis on Twitter” where they took tweets and gathered useful sentiment data. Accompanied is the walkthrough of the three main modules of the process: retrieval, preprocessing, and scoring. 3 Figure 3. The System Breakdown of Sentiment Analysis highlighting the main three modules of “Retrieval”, “Preprocessing” and “Scoring”. 4 Retrieval Module Starting with the retrieval module, Sentiment Analysis obviously requires a lot of data that needs to be gathered. In this example, there is a Twitter API call they used that extracts data (tweets) from a Twitter feed and sends it into a database to be further analyzed. However this extraction can be used for any collection of text documents such as pages in a book, blog posts, or reviews on Amazon. For Sentiment Analysis of just an individual datum of text, retrieval may not even be needed and you can submit your single text directly into the next module. Preprocessing Module Then comes the preprocessing module, in which Sentiment Analysis must clean and scrub the data before submitting it for scoring. Think of this as the carwash before you go and sell your car, as you want the end product from this module to be in the best presentation possible. Data in the large aggregate comes in very messy, with misspellings, not real words, sarcasm, and spam just to name a few of the problems. How can one know if my tweet of “Happy Valley” is talking about the place in State College or just a valley that is happy? Does someone writing the same negative message a hundred times make the review dreadfully negative? Problems like these, and more, are extremely common and some like “detecting sarcasm in online text” are still being solved today. Bing Liu states in one of his papers on this topic that “In most applications, the user needs to know additional details that may not be present in the individual text, but can be better clustered from the analysis of the entire data as a whole.” In this step of the diagram they cleaned all that wasn’t text in a tweet to become understandable text. This required removing of tags, substitution of emoticons for their appropriate word mapping, and spell correction. In any Sentiment Analysis it is important to have a solid preprocessing module as it is the backbone for the quality of output data you get. Scoring Module Finally is the scoring module. Scoring in Sentiment Analysis is done by breaking down each sentence in the text excerpt and then further breaking into each word. Each word is looked up on a mapped historical score from a sentiment dictionary. Each word has a percentage of positive/neutral/negative that the word is most commonly used in. The words are then combined back together into the sentence and the relation between the words are taken account for, so that proper negation of terms and adjectives are aligned. Finally once the entire sentence is aggregately classified, the sentiment score is returned for that sentence. All the sentences are then added and the score for the text data piece submitted is returned. Below (Figure 4) from Stanford Sentiment Lab is an awesome layout of a sentence that has been mapped out during Sentiment Analysis. You can see how the negation of the statement at the beginning turns the whole sentence negative, even though a majority of positive words are used throughout the sentence. 5 Figure 4. A Sentiment Tree breakdown of a sentence and how the scoring system works. Though the details of how the exact scores are calculated from the mapped sentiment dictionary is slightly omitted, it is okay because it is not necessary to correctly use Sentiment Analysis and involves getting into complex machine learning algorithms that have gladly already been done previously and have been released as open source for anyone to use. These outcomes of sentiment scores from each text piece can be usefully applied to further correlations of a person’s sentiment is to any number of features. Conclusion and Future Development In summary the Sentiment Analysis process involves gathering textual data, submitting it through a rigorous preprocessing module, and then scoring the text on a polarity mapped table. This analysis can be used with any size or quantity of text data to get an accurate sentiment score and that score is the output for a future correlation to help solve a problem. Application of this analysis process is still an expanding, uncharted territory in which Sentiment Analysis is solving problems that some would have never thought of before. The ability to use sentiment data to yield high correlations with political public opinion polls is currently being pioneered at Carnegie Mellow with the study of Twitter sentiment. “The Web is so mainstream that there’s no question that the Web is representative somehow of the population,” O’Connor said in an interview about his team’s study on presidential popularity. Throughout the current pursuit of correlations with sentiment using Sentiment Analysis, it is important to remember that just because two entities are correlated with each other does not mean that causation is between the two of them. The entire process is just a way to have accurate predictions of repeat attempts on similar data, and must be further accompanied with causation studies to have that specific correlation defined as a cause/effect. 6 Works Cited Figure 1 Applying Sentiment Analysis. Digital image. Sdg Blog. N.p., n.d. Web. 26 Mar. 2014. Figure 2 Das, Amitava, Sivaji Bandyopadhyay, and Bjorn Gamback. Excitement on July 16th. Digital image. ACM. N.p., n.d. Web. 26 Mar. 2014. Figure 3 Kumar, Akshi, and Teeja Mary Sebastian. Sentiment Analysis System Architecture. Digital image. IJCSI. N.p., n.d. Web. 26 Mar. 2014. Figure 4 Sentiment Trees. Digital image. Stanford NLP. N.p., n.d. Web. 26 Mar. 2014. Das, Amitava, Sivaji Bandyopadhyay, and Bjorn Gamback. "Sentiment Analysis: What Is the End User's Requirement?" ACM, 13 June 2012. Web. 5 Mar. 2014. Kumar, Akshi, and Teeja Mary Sebastian. "Sentiment Analysis on Twitter." IJCSI, July 2012. Web. 5 Mar. 2014. Liu, Bing. Sentiment Analysis and Opinion Mining. San Rafael: Morgan & Claypool Publishers, 2012. Ebook Library. Web. 05 Mar. 2014. Spice, Byron. "Carnegie Mellon Study of Twitter Sentiments Yields Results Similar to Public Opinion Polls." Carnegie Mellon Study of Twitter Sentiments Yields Results Similar to Public Opinion Polls. N.p., n.d. Web. 05 Mar. 2014. Thelwall, Mike; Buckley, Kevan; Paltoglou, Georgios; Cai, Di; Kappas, Arvid (2010). "Sentiment strength detection in short informal text". Journal of the American Society for Information Science and Technology 61 (12): 2544–2558. 7