Sentiment analysis is the usage of NLP, statistics, or machine learning methods to extract, identify, or otherwise characterize the sentiment content of a text unit. The rise of social media has caused an avalanche of data. However, careful analysis of this data returns valuable information. One of these types of analysis is sentiment analysis. There are several methods to perform sentiment analysis. These methods have different success rate depending on their data source and methodology.
This paper provides a comparative study of 3 methods proximity based, topic detection and knowledge enhancer.
The Proximity based approach considers a new set of features based on word proximities in a written text. We propose three proximity-based features, namely, proximity distribution, mutual information between proximity types, and proximity patterns. The topic detection method focuses on techniques that could detect the topics that are highly correlated with the positive and negative opinions present in the text. In the third method the domain specific seed based enrichment technique methodology facilitates the extraction of keywords, entities, synonyms, and parts of speech from tweets which are then used for tweets classification and sentimental analysis. Then the knowledge enhancer and synonym binder module is applied on the extracted information.
Sentiment analysis, suicide prediction, social media, data anbalysis.
Social media websites and applications are open forums for users to post their opinions and emotions. Users from a plethora of demographics use these social media. Be it blogs, review forums, tweets of Facebook comments, they are all classified as data. This data when analyzed correctly will return majority opinion about a particular topic. The user review on a web topic is particular useful information. These reviews usually concern such sentiment or opinions as positive, negative or mixed mood which will affect other users' behaviors, psychological and cognitive activities.
Classical web sentiment analysis system could not reach the real-time demand. At the same time it can also be used to classify certain traits in a person by comparing him to other people having his traits. This action further allows us to be able to predict what a person is going to do by considering what similar people have done. Success in the development of statistical natural language processing (NLP) has led to improvements in fundamental text analysis such as part-ofspeech (POS) tagging, phrase chunking, dependency analysis and parsing. Using these components as fundamental building blocks, many NLP researchers have become interested in analyzing text "semantically" or "contextually". For example,
named entity tagging, semantic role tagging and discourse parsing are being investigated in the NLP fields.
This paper focuses on the sentiment analysis methods that can be used to recognize users of social networking sites who are likely to commit suicide. We compare three different methods to identify the correct methodology for our study. The first method we consider is the Proximity based approach. The next approach uses topic detection methodology. In the third method, Knowledge enhancer, a system is proposed to process short text and filter them precisely based on the semantics of information contained in them. The paper is organized as follows, section 2 discusses the literature review. Section 3 tabulates the results of the comparative study. Section 4 summarizes the conclusions of the study.
S.M. Shamimul Hasan et.al[1] in his paper uses the idea that if the writing is goal oriented, when a person starts writing positively about a topic or subject they continue with this positive trend for a period of time. Later they will use inflexion words like "however" and then start writing negatively about the topic. In a paragraph people don't repeatedly write one positive and one negative word together. Typically segments of a written text capture a concept or trend of thought over a short period of time. Such trends could fluctuate as we move along the written document. The average distance between positive oriented/ negative-oriented words is expected to be small for segments bearing positive/ negative sentiments. Consequently, the average distance between positive-oriented/ negativeoriented words is expected to be relatively large for segments bearing negative/ positive sentiments.
The methodology uses the Stone Dictionary to identify the polarity of the words
(negative or positive). From the unique word list a pair of words are selected. Then different distances between the pair of words are measured and the number of occurrences of that pair is counted for each particular distance in a segment of a given text. The pattern of words is then classified according to either a supervised or unsupervised approach.
Fig 1: General schematic of proximity approach
Keke Cai et.al[2] in his paper proposes an architecture to not only determine the sentiment of the author but also the cause of the sentiment. The steps of the overall process are, first to collect the related web comments discussing certain object from external content repository, such as blogs, message boards, and news articles etc, and create the web content data warehouse. Next the snippets related to the subjects are extracted from the warehouse. Then the sentiment score is calculated for each snippet. For this the sum of the positive sentiment word scores for a given snippet, minus the sum of the negative sentiment word scores divided by the square root of the overall length of the snippet gives the overall snippet sentiment score. Snippets are then classified into different categories to form the sentiment taxonomy. Sentiment taxonomies are created by measuring the relative sentiment expressed by the words in each snippet and use this numeric score as a way to partition the snippets into positive/negative/neutral categories. Lastly the most significant topics related to each sentiment category are identified. The sentiment topic detection component detects the most significant topics hidden behind each sentiment category using a combined PMI[4] and word support metrics. PMI value evaluates the uniqueness of word to each sentiment category.
PMI and word supports evaluate the importance of a word to each sentiment category from different points of view. When considered simultaneously, they can detect relevant sentiment topic words.
Overall, topic words are identified through the following steps:
1. Classify documents into categories of positive, negative and neutral using sentiment classification techniques.
2. Identify all words in documents and filter all stop words and sentimental words, to keep only non-sentiment words as sentiment topical word candidates.
3. Calculate the frequency of the words remaining in step 2 in each single sentiment category as well as across all categories to establish word supports.
4. Calculate the PMI value of the words in each sentiment category based on the formula.
5. Combine the frequency of the words in each category with its PMI value and select the top frequent words with high PMI value as the final sentiment topic
Rabia Batool et.al[3] in her paper suggests a technique to extract valuable information from tweets and classify the tweets into different categories based on the knowledge contained in them. The collected tweets are given to Alchemy
API. It accepts unstructured text, processes it using natural language processing and machine learning techniques, and returns keywords and sentiments of users about keywords. The proposed system extracts participating keywords and their associated sentiments using Alchemy API which is able to extract sentiments at topic level. After extraction of knowledge, all tweets, participating keywords, and associated sentiments are stored in the repository for further processing First the
Twitter Search API is used to find and archive tweets having specific keyword in XML format. The preprocessor uses a
DOM parser to split the data. Next the slang is removed. The knowledge generator classifies Tweets into categories according to the knowledge in them. Alchemy API then returns keywords and sentiment of users about the keywords using
NLP and machine learning techniques. These sentiments are analyzed at a topic level. Knowledge enhancer adds additional knowledge which was not extracted as a keyword by the
Alchemy API. It uses part of speech tagging and entity extraction. To achieve this, the proposed system has incorporated the addition of subjects, verbs, objects, and entities in knowledge; however, just addition of verb and entities increases information collected from tweets. Synonym binder binds synonyms with each entity and keyword extracted by knowledge generator and knowledge enhancer. Wordnet dictionary is used to bind synonyms with entity and keywords
Jaws API has been used to get synonyms of words from
Wordnet. Synonym binder connects synonyms with words and store them into data store to classify data more precisely. For example, it binds word workout with its synonyms exercise and exercise is present in our seed list but workout is not there. It also covers many word structure problems associated with words e.g., it extracts synonym of calories as calorie and exercises as exercise. Filter engine is used to further classify tweets into categories. The filtering process is domain specific.
To classify tweets, seed lists is used to identify which category the extracted knowledge belongs to.
Fig 2: Framework of the sentiment analysis system
Fig 3: Proposed system architecture for tweet classification and sentiment analysis
In this section we list down the results of our comparative study. The parameters under consideration range from objectives and methods to databases and advantages.
PARAMETERS PROXIMITY BASED TOPIC DETECTION KNOWLEDGE ENHANCER
Basic Idea
Methodology
Proximity-based sentiment analysis is able to extract sentiments from a specific domain, with excellent performance.
Proximity approach and machine learning approach.
This technique goes beyond sentiment classification by focusing on techniques that could detect the topics that are highly correlated with the positive and negative opinions. Such techniques, when coupled with sentiment classification, can help the business analysts to understand both the overall sentiment scope as well as the drivers behind the sentiment.
Semantic based approach and machine learning approach.
Precise extraction of valuable information from short text messages posted on social media
(Twitter) is a collaborative task. In this approach, tweets are analyzed to classify data and sentiments from
Twitter more precisely. Moreover, the extracted knowledge is further enhanced using domain specific seed based enrichment technique.
Seed based enrichment and synonym binder
Database used by the technique
Database used for sentiment analysis
Number of categories of sentiments
Main approaches used
Data Cleansing
Blogs, user opinions, books.
It uses stone dictionary to identify the polarity.
This approach has 3 categories:
1.
2.
3.
1.
Positive-Positive
Positive-Negative
Negative-Negative
Proximity approach:
Unsupervised approach
Mean-Median approach
2.
Machine Learning:
KNN
SVM
J48
Naïve Bayes
It is not used by this approach.
Consumer Generated Media (CGM) such as blogs, machine boards and news articles.
It uses 2 external NLP resources, i.e.
Inquirer database and WordNet to establish positive and negative words.
This approach has 5 categories:
1.
Positive
2.
Neutral
3.
Neutral
4.
Neutral
5.
Negative
1.
Semantic approach:
STD
CHI-Square
2.
Machine Learning:
Naïve Bayes
SVM
Maximum
Entropy
Online posts and tweets (Mainly
Twitter). It uses Archivist which is a service that uses the Twitter
Search API to find and archive tweets having specific keyword. It also uses Grabeeter on the other hand searches and grabs tweets posted by individuals.
Tweets returned by Archivist are in
XML format that need preprocessing before storing them in a repository. DOM parser is used to parse XML document and store them in a repository (a relational database).
This approach has 3 categories:
1.
Positive
2.
Negative
3.
Neutral
1.
Preprocessor
2.
Knowledge Generator
3.
Knowledge Enhancer
4.
Synonym Binder
5.
Filter Engine
It is not used by this approach. Twitter data contains slangs and repeated character like plz and gooood instead of please and good.
Users make use of repetition of same character in tweets to emphasize on a word. This kind of noise can effect knowledge extraction process. To remove slangs, the proposed system facilitate in 1300 slangs removal
from tweets. Slang remover replaces ’plz’ and ’gooood’ with
’please’ and ’good’ respectively.
Success Rate
Advantages/ Issues solved by the technique
This is the first attempt at focusing on only proximity based features.
Disadvantages
72.5%
This methodology is affected by the polarity dictionary that we used. If the dictionary is not correct enough, or not large enough, the result could be significantly impacted.
In this research report we have studied three methodologies used for sentiment analysis. The methodologies are Proximity based approach, Topic detection and Knowledge Enhance module. First we studied the methodologies in detail. We studied the data sets, dictionaries used and examples provided.
Next the attributes for comparison were recognized. Lastly the properties of those attributes are identified in each of the methods and the comparative study is noted down. The parameters for comparison include objectives, advantages, shortcomings, databases and methodologies used.
STD approach is around 40% more accurate than CHE-Square approach.
This approach is able to find the drivers behind the sentiments, which was not possible before. The overall solution not only determines the sentiment about a given topic, but also uncovers the potential root causes of the sentiments.
By applying the Knowledge
Enhancer and Synonym Binder module on the extracted information we have achieved increase in information gain in a range of 0.1% to 55%.
Processing of short text to understand its context and extract the desired knowledge is a challenging task. It needs sequential processing of the text to mine the actual semantics of the contained information. In addition, this can also answer to questions like: what are these tweets about? How are they related? And what sentiments have user expressed about the topic?
For example, if a user posts tips to decrease blood glucose level; this is very informative for diabetic patient; however, is not retrieved when searched with keyword
“diabetes”.
This approach does not include spelling corrections. Hence the words with wrong spellings can’t be classified by this approach.
The disadvantages of such approach are:
First, building labeled documents can be challenging. Some approaches try to use existing movie reviews as training sets. But it is unclear if movie reviews would translate to other domains very well, such as food or financial industry. This leads to the second drawback. That is, the approach may not be adaptive enough to work across different data sets or domains.
[1]. Shamimul H. and Donald A.A. 2011 Detecting Human
Sentiment from Text using a Proximity-Based Approach.
Journal of Digital Information Management,vol 9 no.5
[2]Keke C., Scott S. and Li Zhang. 2008 Leveraging Sentiment
Analysis for Topic Detection. IEEE/WIC/ACM International
Conference on Web Intelligence and Intelligent Agent
Technology
[3]. Rabia B., Asad M. K., Jahanzeb M. and Sungyoung L.
2013, Precise Tweet Classification and Sentiment Analysis.
IEEE
[4]. Pantel P. and D. Lin, “Discovering word senses from text”, Proceedings of ACM SIGKDD Conference on
Knowledge Discovery and Data Mining, 2002, pp. 613-619.
[5]. Stone, P., Dunphy W., Smith, M., Qgilvie, D (1966). The
General Inquirer: A Computer Approach to Content Analysis,
The MIT Press.