Sentiment Analysis Review

advertisement
Sentiment Analysis – Opinion Mining
for the Technical Generation
By: Bryan Dickens
ENGL 202C Section 14
3/28/14
Audience and Scope
The realistic objective is to train future readers who will be involved in textual data
analysis on how sentiment analysis can be applied to their text data to generate an output of
sentiment that can be further used in correlation studies within their research. Specifically this
audience will entail the student(s) who plan to replace my position in Dr. Tucker’s D.A.T.A.
laboratory, as it will catch them up to speed in how much they must understand about the text
data mining and applications that are being done with the sentiment data. It is expected that
the audience will have some prior knowledge of machine learning, the general gist of how data
mining works, and proficient prior coding experience. They will most likely be underclassmen in
college in an engineering or science major. The document is designed to be relatively technical
though still low-level in specific functional details as they are not graduate students with this as
their only focus of study.
The layout plan of this document is to introduce sentiment analysis and its’ uses, then
elaborately explain the process with details and visual descriptions. This is followed by future
development and then a synopsis of what has been covered in a conclusion.
What is Sentiment Analysis?
Figure 1. What exactly
is Sentiment Analysis?
Sentiment Analysis is a relatively new field of a larger computer science machine
learning category called data mining. Sentiment mining incorporates the ability to gather large
amounts of consumer textual data and to break down the positive or negative sentiment of
each text post. This sentiment value is most specifically used for further correlations to any
number of other features that could be of interest. A great example done previously by Mike
Thelwall is his experiment “Sentiment Strength Detection in Short Informal Text” in which he
takes small informal text from MySpace comments and gets accurate sentiment classifications
on the user’s emotions when they wrote the text.
2
Finding a polarity summary of text is a strong data value for an end user, as examined by
Das Amitava in his paper “Sentiment Analysis: What is the End User’s Requirement”. He stated
that sentiment analysis of opinion sentences on any review site allows the site to take the
summary of the sentiment to cater better to that specific site-viewer. As well he shows the
importance of temporal change graphs and how the tracking of sentiment over time can reveal
useful data. The picture below (Figure 2) shows a sentiment polarity graph over time of book
blog posts where we can see a strong peak on July 16th, 2005 which happened to be the Harry
Potter book release date.
Figure 2. A graphical example of sentiment tracking over time with the release of the new Harry
Potter book.
The Sentiment Analysis Process
To break down the process it is best shown through a diagram along with a step by step
description of the different parts of the process. Below (Figure 3) is an example done in a recent
research paper by Akshi Kumar and Teeja Sebastian in “Sentiment Analysis on Twitter” where
they took tweets and gathered useful sentiment data. Accompanied is the walkthrough of the
three main modules of the process: retrieval, preprocessing, and scoring.
3
Figure 3. The System Breakdown of Sentiment Analysis highlighting the main three modules of
“Retrieval”, “Preprocessing” and “Scoring”.
4
Retrieval Module
Starting with the retrieval module, Sentiment Analysis obviously requires a lot of data
that needs to be gathered. In this example, there is a Twitter API call they used that extracts
data (tweets) from a Twitter feed and sends it into a database to be further analyzed. However
this extraction can be used for any collection of text documents such as pages in a book, blog
posts, or reviews on Amazon. For Sentiment Analysis of just an individual datum of text,
retrieval may not even be needed and you can submit your single text directly into the next
module.
Preprocessing Module
Then comes the preprocessing module, in which Sentiment Analysis must clean and
scrub the data before submitting it for scoring. Think of this as the carwash before you go and
sell your car, as you want the end product from this module to be in the best presentation
possible. Data in the large aggregate comes in very messy, with misspellings, not real words,
sarcasm, and spam just to name a few of the problems. How can one know if my tweet of
“Happy Valley” is talking about the place in State College or just a valley that is happy? Does
someone writing the same negative message a hundred times make the review dreadfully
negative? Problems like these, and more, are extremely common and some like “detecting
sarcasm in online text” are still being solved today. Bing Liu states in one of his papers on this
topic that “In most applications, the user needs to know additional details that may not be
present in the individual text, but can be better clustered from the analysis of the entire data as
a whole.” In this step of the diagram they cleaned all that wasn’t text in a tweet to become
understandable text. This required removing of tags, substitution of emoticons for their
appropriate word mapping, and spell correction. In any Sentiment Analysis it is important to
have a solid preprocessing module as it is the backbone for the quality of output data you get.
Scoring Module
Finally is the scoring module. Scoring in Sentiment Analysis is done by breaking down
each sentence in the text excerpt and then further breaking into each word. Each word is
looked up on a mapped historical score from a sentiment dictionary. Each word has a
percentage of positive/neutral/negative that the word is most commonly used in. The words
are then combined back together into the sentence and the relation between the words are
taken account for, so that proper negation of terms and adjectives are aligned. Finally once the
entire sentence is aggregately classified, the sentiment score is returned for that sentence. All
the sentences are then added and the score for the text data piece submitted is returned.
Below (Figure 4) from Stanford Sentiment Lab is an awesome layout of a sentence that has
been mapped out during Sentiment Analysis. You can see how the negation of the statement at
the beginning turns the whole sentence negative, even though a majority of positive words are
used throughout the sentence.
5
Figure 4. A Sentiment Tree breakdown of a sentence and how the scoring system works.
Though the details of how the exact scores are calculated from the mapped sentiment
dictionary is slightly omitted, it is okay because it is not necessary to correctly use Sentiment
Analysis and involves getting into complex machine learning algorithms that have gladly already
been done previously and have been released as open source for anyone to use. These
outcomes of sentiment scores from each text piece can be usefully applied to further
correlations of a person’s sentiment is to any number of features.
Conclusion and Future Development
In summary the Sentiment Analysis process involves gathering textual data, submitting
it through a rigorous preprocessing module, and then scoring the text on a polarity mapped
table. This analysis can be used with any size or quantity of text data to get an accurate
sentiment score and that score is the output for a future correlation to help solve a problem.
Application of this analysis process is still an expanding, uncharted territory in which Sentiment
Analysis is solving problems that some would have never thought of before. The ability to use
sentiment data to yield high correlations with political public opinion polls is currently being
pioneered at Carnegie Mellow with the study of Twitter sentiment. “The Web is so mainstream
that there’s no question that the Web is representative somehow of the population,” O’Connor
said in an interview about his team’s study on presidential popularity.
Throughout the current pursuit of correlations with sentiment using Sentiment Analysis,
it is important to remember that just because two entities are correlated with each other does
not mean that causation is between the two of them. The entire process is just a way to have
accurate predictions of repeat attempts on similar data, and must be further accompanied with
causation studies to have that specific correlation defined as a cause/effect.
6
Works Cited
Figure 1
Applying Sentiment Analysis. Digital image. Sdg Blog. N.p., n.d. Web. 26 Mar. 2014.
Figure 2
Das, Amitava, Sivaji Bandyopadhyay, and Bjorn Gamback. Excitement on July 16th. Digital
image. ACM. N.p., n.d. Web. 26 Mar. 2014.
Figure 3
Kumar, Akshi, and Teeja Mary Sebastian. Sentiment Analysis System Architecture. Digital image.
IJCSI. N.p., n.d. Web. 26 Mar. 2014.
Figure 4
Sentiment Trees. Digital image. Stanford NLP. N.p., n.d. Web. 26 Mar. 2014.
Das, Amitava, Sivaji Bandyopadhyay, and Bjorn Gamback. "Sentiment Analysis: What Is the End
User's Requirement?" ACM, 13 June 2012. Web. 5 Mar. 2014.
Kumar, Akshi, and Teeja Mary Sebastian. "Sentiment Analysis on Twitter." IJCSI, July 2012. Web.
5 Mar. 2014.
Liu, Bing. Sentiment Analysis and Opinion Mining. San Rafael: Morgan & Claypool Publishers,
2012. Ebook Library. Web. 05 Mar. 2014.
Spice, Byron. "Carnegie Mellon Study of Twitter Sentiments Yields Results Similar to Public
Opinion Polls." Carnegie Mellon Study of Twitter Sentiments Yields Results Similar to
Public Opinion Polls. N.p., n.d. Web. 05 Mar. 2014.
Thelwall, Mike; Buckley, Kevan; Paltoglou, Georgios; Cai, Di; Kappas, Arvid (2010). "Sentiment
strength detection in short informal text". Journal of the American Society for
Information Science and Technology 61 (12): 2544–2558.
7
Download