literature review

advertisement
Analysis of Opinion Mining Techniques
Aayush Shah
Siddharth Shah
Lynette D’Mello
Department of Computer Engg
Dwarkadas J. Sanghvi COE
Mumbai, India
Department of Computer Engg
Dwarkadas J. Sanghvi COE
Mumbai, India
Department of Computer Engg
Dwarkadas J. Sanghvi COE
Mumbai, India
shahaayush24@gmail.com
shahsid104@gmail.com
lynette.dmello@djsce.ac.in
ABSTRACT
The increasing data on the internet day by day has made it difficult to
view the data in a structured manner. This has resulted in new and
effective mining techniques being developed for structuring the data.
In this paper we have analyzed two different opinion techniques,
namely, Naive Bayes classifier and Thumbs up or Thumbs Down? As
per the evaluation on the selected set of movie reviews, Naïve Bayes
Classifier performed better as compared to Thumbs up or Thumbs
Down?
Keywords
Opinion Mining; Sentiment Analysis; Naïve Bayes; Peter Turney;
Thumbs up or Thumbs Down?; Laplcian Smoothing
INTRODUCTION
In today’s world, whenever a customer plans on buying a new
product, the first thing one does is to surf the internet and go through
multiple blogs and sites where one can read multiple reviews about
the product. Based on the reviews one reads, one decides whether to
buy that product or not. However, with the advancement of the
internet there have been numerous reviews available about each
product. As a result, it becomes difficult for the customer to read all
the available reviews and decide on the product. Thus, it becomes
very important to mine this large data and classify them. This is
where opinion mining finds its application.
Opinion mining also known as sentiment analysis involves the use of
natural language processing (NLP) to determine the view of the
writer about a particular topic. It has various applications such as
determining the popularity of the product, determining the success of
an ad campaign, analyzing public opinions about company strategies
and political campaigns [6].
The main task in sentiment analysis involves evaluating the
contextual polarity of the document at document, sentence, or
feature/aspect level. The various techniques of sentiment analysis are
categorized as follows: keyword spotting, lexical affinity, statistical
methods, and concept-level techniques[7]. In this paper two different
sentiment analysis techniques are reviewed, namely1. Thumbs up or Thumbs down?
2. Naïve Bayes Classifier.
Results of the two methods are compared using a data set of movie
reviews.
LITERATURE REVIEW
The area of opinion mining and sentiment analysis has garnered
massive interest recently and since 2001, many research problems
and opportunities in this field have come forward. An extensive
survey of more than 300 papers was performed by Pang and Lee [8]
which covered common challenges, applications and major steps of
opinion mining. Different tasks possible and works published related
to opinion mining were presented by Liu [8]. Some of the tasks
covered are subjectivity and sentiment classification, sentiment
lexicon generation, opinion summarization etc. Most recently,
Medhat et al [8] presented a brief description about feature selection
methods and a detailed discussion on sentiment classification
techniques.
A method by Pang and Lee [8] involved applying cut-based
classification for subjectivity classification. The physical proximity
between items to be classified was used by them. A slightly better
accuracy than that of Pang and Lee [8] was achieved by Xuan et al
[8] by constructing a total of 22 syntactic patterns covering adjective,
adverbs, verbs and nouns to evaluate the subjectivity of the text.
Earlier work by Hatzivassiloglou and McKeown’s(1997) [1] is
similar to the method proposed by Peter Turney. Their work is based
on determining the semantic orientation of the adjectives. A four-step
supervised learning algorithm was used to evaluate the semantic
orientation of adjectives from constraints on conjunctions.
Before proposing his method, Peter Turney has carried out multiple
researches related to key phrase extraction and semantic orientation
of keywords. He has also carried out a comparison between
Pointwise Mutual Information (PMI) and
Information Retrieval (IR) and Latent Semantic Analysis (LSA) and
showed the prior method to have better accuracy. As a result of
which he employed the PMI-IR algorithm in his proposed method
[4].
The Naïve Bayes classifier is a popular method, still widely used for
text categorization. It is currently also being used in opinion mining
as well as in automatic medical diagnosis [5].
IMPLEMENTATION
Thumbs Up Or Thumbs Down? By Peter Turney
The method suggested by Peter Turney [1] is a simple unsupervised
leaning algorithm using which the reviews can be classified as
recommended or not recommended.
The algorithm is divided into three different steps. In the first step,
phrases containing adjectives or adverbs are extracted from the input
text using a part-of-speech tagger. In the second step, for each
extracted phrase, the semantic orientation is calculated. The third step
involves calculating average semantic orientation of the extracted
phrases, using which the item discussed in the review is
recommended if the average is positive else it is not recommended.
Classifying Reviews:
In the first step, phrases having adjectives or adverbs are extracted
instead of extracting isolated adjectives. The reason being that the
isolated adjective may not always indicate the correct context
intended by the user. For example, the adjective “low” when used in
the phrase “low salary” may have negative semantic orientation, but
when used in the phrase “low cost” it may have positive semantic
orientation.
Hence, two consecutive words are extracted using the part-of-speech
tagger if they match the following pattern [1]
Table 1.
First Word
Second Word
JJ
RB,RBR, or
RBS
JJ
NN or NNS
RB, RBR, or
RBS
NN OR NNS
Third Word
(Not Extracted)
Anything
JJ
Not NN nor NNS
JJ
JJ
VB, VBD, VBN,
OR VBG
Not NN nor NNS
Not NN nor NNS
In the second step, using Pointwise Mutual Information-Information
Retrieval (PMI-IR) algorithm, the semantic orientation of each
extracted phrase is evaluated.
In order to find the PMI between two words the following formula is
used:
]
𝑝(𝑤𝑜𝑟𝑑1 )𝑝(𝑤𝑜𝑟𝑑2 )
Also, 0.01 is added to the hits in order to prevent division by zero and
those phrases when both hits(phrase NEAR “excellent”) and
hits(phrase NEAR “poor”) were less than four at the same time are
eliminated
In the third and final step, the review is categorized as recommended
if the calculated average SO value is positive or else not
recommended.
Naïve Bayes Classifier with Laplacian
Smoothing
In this method, the words are categorized into different classes, which
in accordance to our dataset are classified as positive and negative
classes. In this classifier, it is assumed that the words in a given class
are conditionally independent of each other. However, the accuracy
of text classifier is not affected much by this assumption [2].
Where,
JJ = adjectives,
NN = nouns
RB = adverbs
VB = verbs
NNP and NNPS indicate singular and plural proper nouns
PMI(word1,word2)=log 2 [
The NEAR operator returns only those documents in which the two
words are separated by at most ten words from each other.
It is one of the simplest text classification methods having several
applications in document categorization, email spam detection,
language detection, personal email sorting and sentiment detection.
The Naive Bayes classifier is a simple probabilistic classifier which is
based on Bayes theorem with strong and naïve independence
assumptions. There are several methods such as random forests,
Support Vector Machines etc., which have shown a better
performance than Naïve Bayes. However, Naïve Bayes is much more
efficient because of less computational requirements (in both memory
and CPU) and a smaller training data set.
Anything
𝑝(𝑤𝑜𝑟𝑑1 &𝑤𝑜𝑟𝑑2 )
hits(phrase NEAR “excellent”) hits(“poor”)
log 2 (hits(phrase NEAR “poor”) hits(“excellent”))(3)
In order to determine whether a document belongs to a particular
class, it is first required to find out the probability of each words
belonging to that class, which is evaluated using the following
formula,
P(xi/c)=
Count of xi in documents of class c
Total no of words in document of class c
(4)
Where, xi is an individual word of the document,
c is a particular class (positive or negative).
(1)
The statistical dependence between the words can be measured using
the above ratio.
The probability of a particular document belonging to a class c i is
calculated using Bayes Theorem.
In order to calculate the sematic orientation (SO), the words
“excellent” and “poor” are used as two reference words. This is
because generally to evaluate a review on a scale of 1 to 5, 1 is
equivalent to “poor” and 5 is equivalent to “excellent”. If the phrase
is strongly linked to “excellent” then the SO value would be positive
and if it is strongly linked with “poor” then the SO value would be
negative.
The Bayes Theorem is given by,
SO(phrase) = PMI(phrase, “excellent”) - PMI(phrase, “poor”) (2)
P(A/B)=
P(B/A) ∗P(A)
P(B)
(5)
Hence using the above formula, the probability of a particular
document(d) belonging to a class ci is given as follows,
P(ci/d) =
P(d/𝑐𝑖 ) ∗P(𝑐𝑖 )
𝑃(𝑑)
(6)
The PMI is calculated by submitting queries to a search engine
and recording the number of hits.
Since, this classifier is based on a very simple assumption that the
words within a class are conditionally independent of each other; this
classifier is termed as “naïve”.
By making some minor algebraic changes on equations (1) and (2),
following formula for SO is obtained,
Using equation (4) and (6) the following equation is obtained,
P(ci/d) =
SO(phrase) =
(πP(𝑥𝑖 /𝑐𝑗 )) ∗P(𝑐𝑗 )
P(d)
(7)
Thus, in order to find in which class the document should be
classified, estimate the product of the probability of each word of the
document given a particular class (likelihood), multiplied by the
probability of the particular class. After calculating the above for all
the classes, select the one with the highest probability.
As the duplicate words do not provide any additional information,
they are eliminated from the document. This type of Naïve Bayes
algorithm is called Bernoulli Naïve Bayes.
Table 3. An example of a movie review that is classified as
recommended
Extracted
Part-Of-Speech
Phrase
Tags
Excellent day
JJ NN
Better aesthetics
JJ NN
Actual persona
JJ NN
Excellent film
JJ NN
Average Semantic Orientation
Semantic
Orientation
2.1825
2.0915
1.4428
1.0613
6.7781
Laplacian Smoothing
The probability of both the classes comes out to be zero in cases
when a new word is seen in the training set, leaving nothing to
compare. Laplacian Smoothing is implemented to solve the above
problem [2].
P(xi|cj)=
Count(𝑥𝑖 ) +k
(k+1)∗(No of words in class 𝑐𝑗 )
Since, the average SO value of the above review is positive, the
movie review is correctly classified as recommended.
On processing the 20 reviews of the data set as per the above given
procedure, 12 of the reviews were classified correctly while the
remaining 8 were classified incorrectly. Hence, the overall accuracy
was found out to be 60%.
(8)
Generally, k is chosen as 1, resulting in equal probability for either
class to contain the new word.
As total word count for each class is computed differently in
Bernoulli Naïve Bayes, each document is reduced to a set of unique
words with no duplicates.
Naive Bayes Classifier with Laplacian
Smoothing
On evaluating the same movie review used in table 2 as per Naïve
Bayes Classifier, following results were obtained:
Table 4.
RESULTS AND DISCUSSIONS
Recommended
Not
Recommended
2.2*10-2
A data set of 40 movie reviews from IMDB was taken. 20 of these
reviews were used for training the Naive Bayes Classifier and the
remaining 20 of them were used for testing both the techniques. The
training and the testing sets both have an equal number of positive
and negative reviews.
Since, the probability of the not recommended class is greater, the
movie review is correctly classified as not recommended.
A Thumbs up or Thumbs down? By Peter
Turney
On evaluating the same movie review used in table 3 as per Naïve
Bayes Classifier, following results were obtained:
In order to determine the hits for each phrase as per equation (3), the
Yahoo search engine was used.
Table 5.
Table 2. An example of the evaluation of a movie review that is
classified as not recommended
Extracted
Part-Of-Speech
Phrase
Tags
Old boy
JJ NN
All-time favorite
JJ NN
Worst movie
JJ NN
Proper horror
JJ NN
Light part
JJ NN
So lame
RB JJ
Not know
RB VB
Younger brother
JJ NN
Short films
JJ NN
Not waste
RB VB
Not worth
RB JJ
Average Semantic Orientation
Semantic
Orientation
-0.1023
-0.276
-0.8746
0.4164
-0.2501
-0.2767
0.096
-0.276
0.1705
-5.713
-0.3450
-7.4308
Since, the average SO value of the above review is negative, the
movie review is correctly classified as not recommended.
3.66*10-4
Table 2
Recommended
Table 3
6.59*10-3
Not
Recommended
2.74*10-4
Since, the probability of the recommended class is greater, the movie
review is correctly classified as recommended.
On processing the 20 reviews of the data set as per the above given
procedure, 14 of the reviews were classified correctly while the
remaining 6 were classified incorrectly. Hence, the overall accuracy
was found out to be 70%.
CONCLUSION
The two different techniques of opinion mining have been compared
and as per the evaluation on the selected set of movie reviews, Naïve
Bayes Classifier performed better as compared to the method
proposed by Peter Turney. Opinion mining is a vastly developing
field with many new and improved methods being developed.
Opinion mining has a tremendous scope in analyzing the vast data
contained in the micro-blogging sites. More and more methods of
opinion mining using machine learning have been devised. These
methods would provide a much higher accuracy than the methods
currently being used.
REFRENCES
[1]
[2]
[3]
[4]
Turney, P.D. (2002), Thumbs up or thumbs down? Semantic orientation
applied to unsupervised classification of reviews, Proceedings of the
40th Annual Meeting of the Association for Computational Linguistics
(ACL'02), Philadelphia, Pennsylvania, pp. 417-424.
Narayanan, Vivek, Ishan Arora, and Arjun Bhatia. "Fast and accurate
sentiment classification using an enhanced Naive Bayes
model." Intelligent Data Engineering and Automated Learning–IDEAL
2013. Springer Berlin Heidelberg, 2013. 194-201..
B.Pang, L. Lee, “Opinion Mining and sentiment analysis,” now, vol. 2,
no.1-2, pp 10-11 march 2008.
Turney, P.D. (2001), Mining the Web for synonyms: PMI-IR versus
LSA on TOEFL,Proceedings of the Twelfth European Conference on
Machine Learning (ECML-2001), Freiburg, Germany, pp. 491-502.
[5]
[6]
[7]
[8]
Naïve
Bayes
classifier
[Online].
Available:
https://en.wikipedia.org/wiki/Naive_Bayes_classifier.
Opinion
Mining
[Online].
Available:
http://searchbusinessanalytics.techtarget.com/definition/opinion-miningsentiment-mining.
Sentiment
analysis
[Online].
Available:
https://en.wikipedia.org/wiki/Sentiment_analysis
Ravi, Kumar, and Vadlamani Ravi. "A survey on opinion mining and
sentiment analysis: Tasks, approaches and applications." KnowledgeBased Systems(2015).
Download