Analysis of Opinion Mining Techniques Aayush Shah Siddharth Shah Lynette D’Mello Department of Computer Engg Dwarkadas J. Sanghvi COE Mumbai, India Department of Computer Engg Dwarkadas J. Sanghvi COE Mumbai, India Department of Computer Engg Dwarkadas J. Sanghvi COE Mumbai, India shahaayush24@gmail.com shahsid104@gmail.com lynette.dmello@djsce.ac.in ABSTRACT The increasing data on the internet day by day has made it difficult to view the data in a structured manner. This has resulted in new and effective mining techniques being developed for structuring the data. In this paper we have analyzed two different opinion techniques, namely, Naive Bayes classifier and Thumbs up or Thumbs Down? As per the evaluation on the selected set of movie reviews, Naïve Bayes Classifier performed better as compared to Thumbs up or Thumbs Down? Keywords Opinion Mining; Sentiment Analysis; Naïve Bayes; Peter Turney; Thumbs up or Thumbs Down?; Laplcian Smoothing INTRODUCTION In today’s world, whenever a customer plans on buying a new product, the first thing one does is to surf the internet and go through multiple blogs and sites where one can read multiple reviews about the product. Based on the reviews one reads, one decides whether to buy that product or not. However, with the advancement of the internet there have been numerous reviews available about each product. As a result, it becomes difficult for the customer to read all the available reviews and decide on the product. Thus, it becomes very important to mine this large data and classify them. This is where opinion mining finds its application. Opinion mining also known as sentiment analysis involves the use of natural language processing (NLP) to determine the view of the writer about a particular topic. It has various applications such as determining the popularity of the product, determining the success of an ad campaign, analyzing public opinions about company strategies and political campaigns [6]. The main task in sentiment analysis involves evaluating the contextual polarity of the document at document, sentence, or feature/aspect level. The various techniques of sentiment analysis are categorized as follows: keyword spotting, lexical affinity, statistical methods, and concept-level techniques[7]. In this paper two different sentiment analysis techniques are reviewed, namely1. Thumbs up or Thumbs down? 2. Naïve Bayes Classifier. Results of the two methods are compared using a data set of movie reviews. LITERATURE REVIEW The area of opinion mining and sentiment analysis has garnered massive interest recently and since 2001, many research problems and opportunities in this field have come forward. An extensive survey of more than 300 papers was performed by Pang and Lee [8] which covered common challenges, applications and major steps of opinion mining. Different tasks possible and works published related to opinion mining were presented by Liu [8]. Some of the tasks covered are subjectivity and sentiment classification, sentiment lexicon generation, opinion summarization etc. Most recently, Medhat et al [8] presented a brief description about feature selection methods and a detailed discussion on sentiment classification techniques. A method by Pang and Lee [8] involved applying cut-based classification for subjectivity classification. The physical proximity between items to be classified was used by them. A slightly better accuracy than that of Pang and Lee [8] was achieved by Xuan et al [8] by constructing a total of 22 syntactic patterns covering adjective, adverbs, verbs and nouns to evaluate the subjectivity of the text. Earlier work by Hatzivassiloglou and McKeown’s(1997) [1] is similar to the method proposed by Peter Turney. Their work is based on determining the semantic orientation of the adjectives. A four-step supervised learning algorithm was used to evaluate the semantic orientation of adjectives from constraints on conjunctions. Before proposing his method, Peter Turney has carried out multiple researches related to key phrase extraction and semantic orientation of keywords. He has also carried out a comparison between Pointwise Mutual Information (PMI) and Information Retrieval (IR) and Latent Semantic Analysis (LSA) and showed the prior method to have better accuracy. As a result of which he employed the PMI-IR algorithm in his proposed method [4]. The Naïve Bayes classifier is a popular method, still widely used for text categorization. It is currently also being used in opinion mining as well as in automatic medical diagnosis [5]. IMPLEMENTATION Thumbs Up Or Thumbs Down? By Peter Turney The method suggested by Peter Turney [1] is a simple unsupervised leaning algorithm using which the reviews can be classified as recommended or not recommended. The algorithm is divided into three different steps. In the first step, phrases containing adjectives or adverbs are extracted from the input text using a part-of-speech tagger. In the second step, for each extracted phrase, the semantic orientation is calculated. The third step involves calculating average semantic orientation of the extracted phrases, using which the item discussed in the review is recommended if the average is positive else it is not recommended. Classifying Reviews: In the first step, phrases having adjectives or adverbs are extracted instead of extracting isolated adjectives. The reason being that the isolated adjective may not always indicate the correct context intended by the user. For example, the adjective “low” when used in the phrase “low salary” may have negative semantic orientation, but when used in the phrase “low cost” it may have positive semantic orientation. Hence, two consecutive words are extracted using the part-of-speech tagger if they match the following pattern [1] Table 1. First Word Second Word JJ RB,RBR, or RBS JJ NN or NNS RB, RBR, or RBS NN OR NNS Third Word (Not Extracted) Anything JJ Not NN nor NNS JJ JJ VB, VBD, VBN, OR VBG Not NN nor NNS Not NN nor NNS In the second step, using Pointwise Mutual Information-Information Retrieval (PMI-IR) algorithm, the semantic orientation of each extracted phrase is evaluated. In order to find the PMI between two words the following formula is used: ] 𝑝(𝑤𝑜𝑟𝑑1 )𝑝(𝑤𝑜𝑟𝑑2 ) Also, 0.01 is added to the hits in order to prevent division by zero and those phrases when both hits(phrase NEAR “excellent”) and hits(phrase NEAR “poor”) were less than four at the same time are eliminated In the third and final step, the review is categorized as recommended if the calculated average SO value is positive or else not recommended. Naïve Bayes Classifier with Laplacian Smoothing In this method, the words are categorized into different classes, which in accordance to our dataset are classified as positive and negative classes. In this classifier, it is assumed that the words in a given class are conditionally independent of each other. However, the accuracy of text classifier is not affected much by this assumption [2]. Where, JJ = adjectives, NN = nouns RB = adverbs VB = verbs NNP and NNPS indicate singular and plural proper nouns PMI(word1,word2)=log 2 [ The NEAR operator returns only those documents in which the two words are separated by at most ten words from each other. It is one of the simplest text classification methods having several applications in document categorization, email spam detection, language detection, personal email sorting and sentiment detection. The Naive Bayes classifier is a simple probabilistic classifier which is based on Bayes theorem with strong and naïve independence assumptions. There are several methods such as random forests, Support Vector Machines etc., which have shown a better performance than Naïve Bayes. However, Naïve Bayes is much more efficient because of less computational requirements (in both memory and CPU) and a smaller training data set. Anything 𝑝(𝑤𝑜𝑟𝑑1 &𝑤𝑜𝑟𝑑2 ) hits(phrase NEAR “excellent”) hits(“poor”) log 2 (hits(phrase NEAR “poor”) hits(“excellent”))(3) In order to determine whether a document belongs to a particular class, it is first required to find out the probability of each words belonging to that class, which is evaluated using the following formula, P(xi/c)= Count of xi in documents of class c Total no of words in document of class c (4) Where, xi is an individual word of the document, c is a particular class (positive or negative). (1) The statistical dependence between the words can be measured using the above ratio. The probability of a particular document belonging to a class c i is calculated using Bayes Theorem. In order to calculate the sematic orientation (SO), the words “excellent” and “poor” are used as two reference words. This is because generally to evaluate a review on a scale of 1 to 5, 1 is equivalent to “poor” and 5 is equivalent to “excellent”. If the phrase is strongly linked to “excellent” then the SO value would be positive and if it is strongly linked with “poor” then the SO value would be negative. The Bayes Theorem is given by, SO(phrase) = PMI(phrase, “excellent”) - PMI(phrase, “poor”) (2) P(A/B)= P(B/A) ∗P(A) P(B) (5) Hence using the above formula, the probability of a particular document(d) belonging to a class ci is given as follows, P(ci/d) = P(d/𝑐𝑖 ) ∗P(𝑐𝑖 ) 𝑃(𝑑) (6) The PMI is calculated by submitting queries to a search engine and recording the number of hits. Since, this classifier is based on a very simple assumption that the words within a class are conditionally independent of each other; this classifier is termed as “naïve”. By making some minor algebraic changes on equations (1) and (2), following formula for SO is obtained, Using equation (4) and (6) the following equation is obtained, P(ci/d) = SO(phrase) = (πP(𝑥𝑖 /𝑐𝑗 )) ∗P(𝑐𝑗 ) P(d) (7) Thus, in order to find in which class the document should be classified, estimate the product of the probability of each word of the document given a particular class (likelihood), multiplied by the probability of the particular class. After calculating the above for all the classes, select the one with the highest probability. As the duplicate words do not provide any additional information, they are eliminated from the document. This type of Naïve Bayes algorithm is called Bernoulli Naïve Bayes. Table 3. An example of a movie review that is classified as recommended Extracted Part-Of-Speech Phrase Tags Excellent day JJ NN Better aesthetics JJ NN Actual persona JJ NN Excellent film JJ NN Average Semantic Orientation Semantic Orientation 2.1825 2.0915 1.4428 1.0613 6.7781 Laplacian Smoothing The probability of both the classes comes out to be zero in cases when a new word is seen in the training set, leaving nothing to compare. Laplacian Smoothing is implemented to solve the above problem [2]. P(xi|cj)= Count(𝑥𝑖 ) +k (k+1)∗(No of words in class 𝑐𝑗 ) Since, the average SO value of the above review is positive, the movie review is correctly classified as recommended. On processing the 20 reviews of the data set as per the above given procedure, 12 of the reviews were classified correctly while the remaining 8 were classified incorrectly. Hence, the overall accuracy was found out to be 60%. (8) Generally, k is chosen as 1, resulting in equal probability for either class to contain the new word. As total word count for each class is computed differently in Bernoulli Naïve Bayes, each document is reduced to a set of unique words with no duplicates. Naive Bayes Classifier with Laplacian Smoothing On evaluating the same movie review used in table 2 as per Naïve Bayes Classifier, following results were obtained: Table 4. RESULTS AND DISCUSSIONS Recommended Not Recommended 2.2*10-2 A data set of 40 movie reviews from IMDB was taken. 20 of these reviews were used for training the Naive Bayes Classifier and the remaining 20 of them were used for testing both the techniques. The training and the testing sets both have an equal number of positive and negative reviews. Since, the probability of the not recommended class is greater, the movie review is correctly classified as not recommended. A Thumbs up or Thumbs down? By Peter Turney On evaluating the same movie review used in table 3 as per Naïve Bayes Classifier, following results were obtained: In order to determine the hits for each phrase as per equation (3), the Yahoo search engine was used. Table 5. Table 2. An example of the evaluation of a movie review that is classified as not recommended Extracted Part-Of-Speech Phrase Tags Old boy JJ NN All-time favorite JJ NN Worst movie JJ NN Proper horror JJ NN Light part JJ NN So lame RB JJ Not know RB VB Younger brother JJ NN Short films JJ NN Not waste RB VB Not worth RB JJ Average Semantic Orientation Semantic Orientation -0.1023 -0.276 -0.8746 0.4164 -0.2501 -0.2767 0.096 -0.276 0.1705 -5.713 -0.3450 -7.4308 Since, the average SO value of the above review is negative, the movie review is correctly classified as not recommended. 3.66*10-4 Table 2 Recommended Table 3 6.59*10-3 Not Recommended 2.74*10-4 Since, the probability of the recommended class is greater, the movie review is correctly classified as recommended. On processing the 20 reviews of the data set as per the above given procedure, 14 of the reviews were classified correctly while the remaining 6 were classified incorrectly. Hence, the overall accuracy was found out to be 70%. CONCLUSION The two different techniques of opinion mining have been compared and as per the evaluation on the selected set of movie reviews, Naïve Bayes Classifier performed better as compared to the method proposed by Peter Turney. Opinion mining is a vastly developing field with many new and improved methods being developed. Opinion mining has a tremendous scope in analyzing the vast data contained in the micro-blogging sites. More and more methods of opinion mining using machine learning have been devised. These methods would provide a much higher accuracy than the methods currently being used. REFRENCES [1] [2] [3] [4] Turney, P.D. (2002), Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL'02), Philadelphia, Pennsylvania, pp. 417-424. Narayanan, Vivek, Ishan Arora, and Arjun Bhatia. "Fast and accurate sentiment classification using an enhanced Naive Bayes model." Intelligent Data Engineering and Automated Learning–IDEAL 2013. Springer Berlin Heidelberg, 2013. 194-201.. B.Pang, L. Lee, “Opinion Mining and sentiment analysis,” now, vol. 2, no.1-2, pp 10-11 march 2008. Turney, P.D. (2001), Mining the Web for synonyms: PMI-IR versus LSA on TOEFL,Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001), Freiburg, Germany, pp. 491-502. [5] [6] [7] [8] Naïve Bayes classifier [Online]. Available: https://en.wikipedia.org/wiki/Naive_Bayes_classifier. Opinion Mining [Online]. Available: http://searchbusinessanalytics.techtarget.com/definition/opinion-miningsentiment-mining. Sentiment analysis [Online]. Available: https://en.wikipedia.org/wiki/Sentiment_analysis Ravi, Kumar, and Vadlamani Ravi. "A survey on opinion mining and sentiment analysis: Tasks, approaches and applications." KnowledgeBased Systems(2015).