Crowd explicit sentiment analysis (1/4)

advertisement
Crowd explicit sentiment analysis
A. Montejo-Raez, M.C. Diaz-Galiano, F. Martinez-Santiago, L.A.
Urena-Lopez
Computer Science Department, University of Jaen
Knowledge-Based Systems 69 (2014)
報告者:劉憶年 2015/11/3
Outline
Introduction
Sentiment analysis in social media
Using the crowd to collect affective terms
Crowd explicit sentiment analysis
Experiments and results
Conclusions and further work
2
Introduction (1/3)
There is a significant increase in information sources and,
therefore, data registered by our society since the advent
of information and communication technologies. We all
are now aware of it, but we may not realize the
impressive numbers behind this fact.
We cannot ignore this explosion and how Big Data has
been revealed as the new major challenge in
computation.
Our ability to produce unlimited information may contrast
with storage limits and computer performances, which,
although still in continuous advance, may be overtaken
soon by human and non-human uploaded data.
3
Introduction (2/3)
This paper is only a small proposal on how we can face
the new era of overwhelming availability of information,
providing a strategy based on the construction of
knowledge from that continuous stream of data by
keeping a tiny, but filtered, part of it.
In fact, we do not have to, as we have just too much of it.
New knowledge can be constructed with the help of
simple heuristics. It is similar to the approach followed by
our brain, which is also filtering at every second
thousands of incoming stimuli, generating from the rest
what may not be the best understanding of reality, but a
knowledge more than valid for our survival as human
beings.
4
Introduction (3/3)
Following this reasoning, we have designed an approach
to polarity classification in sentiment analysis. It can be
considered a simple approach, but the results confirm its
validity, encouraging us to discover new areas where the
idea ‘‘Let the crowd express itself’’ could be applied.
5
Sentiment analysis in social media (1/3)
Sentiment Analysis (also known as Opinion Mining) is
one of the most active research areas in Natural
Language Processing nowadays, with special interest in
the classification of texts into positive, negative or neutral.
Supervised strategies have reported the best results
since the earliest studies and are still the choice for many
solutions, from Information Theory based features (with
SVM classifier) to more complex learned rules.
Unsupervised approaches have relied mainly on the use
of lexicons where words are associated with polarity
scores, although more advanced solutions using
intensive lexical analysis are proposed or using deep
learning approaches, the latter being a very promising
and innovative way to tackle the problem.
6
Sentiment analysis in social media (2/3)
The second group of proposals (unsupervised ones) are
mainly based on the creation of a list of affective terms,
which is, again, usually closely related to the domain of
the targeted texts and hard to generate in different
languages.
At any rate, these studies follow the line of concept-level
sentiment analysis, as the analyzed texts are
represented by a vector of ‘‘feelings’’ rather than pure
term vectors. Most of these knowledge bases do not
consider contextual information that may modify the
polarity of collected concepts, as some terms may
become subjective when accompanying certain words, or
even completely change their polarity values when used
along with some modifiers or within the context of certain
phrases.
7
Sentiment analysis in social media (3/3)
Twitter has been found to be very useful in many
scenarios, like real-time Recommender Systems, cinema
revenue prediction or even crime prediction, among
others.
8
Using the crowd to collect affective terms
-- The WeFeelFine project (1/3)
Since the year 2005, the website WeFeelFine1 has been
harvesting from social media millions of sentences
containing ‘‘I feel’’ or ‘‘I am feeling’’ expressions, creating
a huge database of sentences related to feelings or
emotions.
Although the main goal of the project is to serve as a
monitor of the human state at a global level, we found
that the collected data could be useful in sentiment
analysis. The current list of feelings stored contains 2178
different feelings, although the 200 most frequent ones
represent 70% of a total of almost 2 million sentences.
These are the ones considered in this study.
9
Using the crowd to collect affective terms
-- The WeFeelFine project (2/3)
10
Using the crowd to collect affective terms
-- The WeFeelFine project (3/3)
WeFeelFine is a very interesting project and its
continuous trawling of data could represent a valuable
resource in sentiment analysis, as considered by
previous studies, where a bag of sentiment words is
created using WeFeelFine lists of feelings and
augmented with synonyms and antonyms from
Thesaurus.
11
Using the crowd to collect affective terms
-- The MeSientoX corpus (1/2)
As the generation of a feelings database based on
simple regular expressions is not a difficult task, we
decided to test this approach for Spanish. Instead of
translating the texts from the WeFeelFine database,
almost two million Spanish tweets that contain the words
‘‘me siento’’ (‘‘I feel’’) were collected by means of the
Twitter API.
The tweets were retrieved during 35 days, between
December 2012 and January 2013, collecting a total of
1,863,758 tweets.
Our first attempt at polarity classification with this data
was performed with promising results [self-reference
removed], and also with a retrieval based solution.
12
Using the crowd to collect affective terms
-- The MeSientoX corpus (2/2)
A unified form is the merging of the two forms derived
from genre variants in Spanish, so tweets with the
expression ‘‘Me siento cansada’’ or ‘‘Me siento cansado’’
would be under the same feeling cansado. We also
discarded those unified forms that could be considered
non-sentiment words, such as words in a non-Spanish
language (alone, crazy). This is the only step involving
human intervention, though the effort is minimal (the
emotions extracted were labeled in less than ten
minutes).
The number of sentiment words selected was 201 (the
most frequent ones), of which 84 were considered as
positive and 117 as negative from a total number of
different unified forms of 344.
13
Crowd explicit sentiment analysis (1/4)
Explicit Semantic Analysis proposes the use of a
collection of documents to form the indexes of new
documents.
In our case, the WeFeelFine data is taken as the base
for generating English feelings documents, whereas
tweets extracted from MeSientoX are used to generate
Spanish feelings documents. Each feeling X is
represented as a compilation of the tweets retrieved
containing the expression Me siento X, assuming that the
term X refers to a feeling. Thus, instead of projecting a
document onto a space of articles, it is projected onto a
space of feelings collected automatically from social
media posts. The distances of the vector are cosine
distances obtained by means of a Latent Semantic
Analysis.
14
Crowd explicit sentiment analysis (2/4)
15
Crowd explicit sentiment analysis (3/4)
Therefore, a document is preprocessed to obtain its
vector and then multiplied by a low-rank approximation
(by Singular Value Decomposition) of the feeling-to-term
matrix created from the corpora generated from microblog posts.
16
Crowd explicit sentiment analysis (4/4)
This second way of computing the final polarity value
only takes into consideration the order of the feelings, not
the actual distance of them to the target document. As
feelings ‘‘emerge’’ from the collected data, and due to the
randomness of texts captured, the cosine distance may
not add relevant information to the model.
17
Crowd explicit sentiment analysis
-- Integrating SenticNet 3
The last one available (3-beta) has been enhanced by
considering further knowledge sources. Although even
fewer concepts are considered in SenticNet 3 compared
to SenticNet 2, the inclusion of common and common
sense knowledge has resulted in a most coherent net of
emotional concepts. We found that it could be the
straightforward solution for labeling crowd-based
emotional concepts (feelings).
18
Experiments and results (1/4)
For the English experiments, the Emoticon data set from
Stanford University was selected. In order to enable the
comparison of results with other approaches, only the
test set is considered. It contains 177 negative tweets
and 182 positive tweets, manually labeled. For
experiments with Spanish, we selected the SFU Review
corpus. It is composed of 400 reviews divided into eight
categories: cars, hotels, washing machines, books, cell
phones, music, computers, and movies. Each category
contains 50 positive and 50 negative reviews, defined as
positive or negative based on the number of stars given
by the reviewer (1–2 = negative; 4–5 = positive; 3-star
reviews are not included). These reviews were collected
from the Ciao web site.
19
Experiments and results (2/4)
20
Experiments and results (3/4)
This finding makes us think of the complexity of
sentiment representation, so documents are better
represented by several emotional states instead of pure
polarity classes. This is in agreement with a treatment of
the sentiment analysis problem at a concept-level.
One reason for such a behavior may be the big
difference in the quality of the texts between the two data
sets. WeFeelFine provides good grammar and very few
misspellings or jargon terms and sentences are longer
and richer in expressiveness, whilst the compilation of
tweets in the case of MeSientoX corpus is far from well
written Spanish. Also, the cosine distance may not reflect
the real contrast between different feelings in the latter
case.
21
Experiments and results (4/4)
Nevertheless, the accuracy obtained is high, taking into
consideration that there is no normalization over the text
obtained from WeFeelFine and neither over the test data.
Thus, our approach shows that good performance can
be obtained with this straightforward solution, based
purely on capturing emotional expressions from blogs
and other channels of social communication.
In this case, the results of our approach outperform the
lexical based solution proposed by these authors.
22
Conclusions and further work (1/3)
Crowd Explicit Sentiment Analysis has been introduced
as a stream-based approach for polarity classification. Its
simple design allows for the construction of polarity
classifiers in different languages and domains without the
need for complex linguistic resources or architectures.
Nevertheless, further research has to be performed as
many issues could be explored in order to improve the
proposed method. For example, we have found that
large quantities of texts without relevant content are
captured by the expressions used. Thus, a selection of
terms and posts has to be done.
23
Conclusions and further work (2/3)
Thus, the difference in the accuracy value may be due to
the length of the documents or related to language
issues. This needs further analysis and study by
exploring more comparable corpora.
Therefore, a more accurate capturing technique could be
useful, although it is in the intent of the method to avoid
too sophisticated solutions, as its strength lies in its
simplicity.
24
Conclusions and further work (3/3)
Nevertheless, we plan to use these methods but
constructing the models using the vectors of feelings that
CESA generates. Social based representation of
emotions has been found as a valid solution to model
affective communication. It could outperform the two
approaches for final polarity calculation, as we expect to
confirm with future experimentation.
In any case, the solutions that could be constructed
based on the idea of ‘‘let the crowd express itself’’ are
very suitable for big data environments. We believe that
on-line learning algorithms and evolving training data
could be the key to modeling the knowledge that
emerges from the vast amount of texts published every
second, everywhere.
25
Download