MIS696A Reading in MIS Research Proposal Detect movie spoilers

advertisement
MIS696A Reading in MIS
Research Proposal
Detect movie spoilers: An external resource independent approach
Candidate ShanJiang
Date September 29, 2010
Introduction
With more and more common users becoming comfortable with the Web, an increasing number of
people are wirting reviews (Hu & Liu, 2004), which provides a rich source of useful information to
customers. As a result, people tend to refer to online customer reviews before purchasing some
products. It holds true in movie industry. To avoid watching junk movies so as to save time and money,
people like to go through the comments from those who have already watched the movies.
Usually, the more information contained in reviews, the more customers are able to benefit from the
opinions. However, it does not hold water in movie domain. Since products such as movies are intended
to bring customers with ‘experiential consumption’(Tsang & Yan, 2010), and the value lies in some kind
of unexpected information, the revelation of such information will seriously harm customers’ motivation
to purchase the product, and thus profits the products should have made. For example, if people know
the criminal in advance in a detective film, they are less likely to enjoy the movie. This type of unwanted
information associated with experiential consumption is called spoilers. In the domain of movies, the
term movie spoiler is often used.
Movie spoilers are very common in existing online movie review websites such as
IMDB(www.imdb.com),Yahoo Movie Review(http://movies.yahoo.com) etc. To make things worse,
few websites have a mechanism for users to hide movie spoilers(need reference).Nevertheless, most
people consulting with movie reviews wish only to know how the movie is rated in quality, but not the
plots in detail. Therefore, a way to screening movie spoilers from people who do not like to know too
much about the story will be useful.
The objective of this study is to find out a method to automatically detect movie spoilers from movie
review corpora. Text mining approaches including opinion mining, sentiment analysis and affect analysis
will be employed.
Related Work
Mining and analyzing customer review has been popular research topic in text mining. One thread of
research focuses on the structured part of the review, i.e. numerical rating data (Cheung, Kwok, Law &
Tsui, 2003; Jindal & Liu,2006). These studies were aimed at ranking the products according to the quality
reflected by reviews((Cheung, Kwok, Law & Tsui, 2003) or detecting the review spammers(Jindal &
Liu,2006;Lim, Nguyen,Jindal, Liu & Lauw 2010). More researches were conducted on applying text
mining techniques to extract sentiments and features from unstructured comments from the body of
customer reviews(Nasukawa & Yi, 2003;Lee.T, 2004;Hu & Liu, 2004), and most researches focused on
sentiment classification(Turney 2002;Dave, Lawrence & Pencock,2003). An important finding is that
classifiers works well on whole review but not in sentence level because of too little information
(Lawrence & Pencock,2003). This indicates that sentence level analysis cannot be performed with simple
sentiment analysis. Nevertheless, Semi-structured comment is proposed as containing separated pros
and cons in each comment body(Liu, Hu & Cheng,2005) or comments are given for different product
features(Hu & Liu,2004; Liu, Hu & Cheng,2005). In these works, researchers strived to analyze the
comments in sentence level to convert unstructured customer reviews into a more organized and
feature-classified comment summary. A survey that covers techniques and approaches regarding
opinion mining and sentiment analysis can be found in(Pang & Lee,2008).
Movie review mining is a special type of customer review analysis that uses movie review as datasets. It
distinguish itself from other kinds of reviews in that factual information is always mixed with real-life review
data and ironic words are used in writing reviews(Chaovalit & Zhou, 2005), which might cause misleading
effect in sentiment analysis or affect analysis. Past research mainly focused on the tasks of classification
of movie reviews according to expressed sentiment(Pang, Lee & Vaithyanathan,2002; Chaovalit & Zhou,
2005; Kennedy & Inkpen, 2006) or automatic review summarization(Zhuang, Jing & Zhu, 2006).
Techniques used include multi-knowledge based approaches(Zhuang et al. 2006),machine learning(Pang
et al.,2002; Kennedy & Inkpen, 2006;Chavolit & Zhou, 2005) and semantic orientation(Kennedy & Inkpen,
2006;Chavolit & Zhou, 2005). They can be summarized as Table 1.1.
Table 1.1
A summary of movie review research.
Pang et.al., 2002
Chaovalit & Zhou 2005
Kennedy & Inkpen, 2006
Zhuang et al. 2006
Objective
Sentiment Classification
Sentiment Classification
Sentiment Classification
Summarization
Techniques Used
Machine learning
Machine learning, Semantic Orientation
Machine learning, Semantic Orientation
Multi-knowledge based
However, very few researchers have mentioned the concept of movie spoilers in past study. Formally, the
spoiler effect denotes a phenomenon that a consumer’s interest in consuming a particular narrative is
reduced after exposure to a spoiler(Tsang & Yan, 2010).Spoiler exposure creates a satiation effect
(Redden, 2008) and an explanation effect (Wilson, Centerbar, Kermer, and Gilbert, 2005), which reduce
the intent of the customers to actually purchase the information providing product. Among these
studies, semantic information regarding spoilers is rarely used. In a research on return-on-investment of
movie studio, textual information in movie spoilers are used just as a substitution of movie scripts due to
the difficulty of directly obtaining manuscripts in electronic form (Eliashberg, Hui & Zhang, 2006). Only one
study is related to automatically detecting the movie spoilers (Guo & Ramakrishnan, 2010).
Research Gap and Research Question
Guo and Ramakrishnan developed a topic model based on Latent Dirichlet Allocation (LDA) to detect
comments and reviews containing spoilers by comparing them with movies’ synopsis coming from IMDB,
and then rank them according to the similarity with synopsis. In general, the closer to the synopsis in topic,
the more likely a review contains spoilers. However, this approach is limited to the case with existence of
movies’ synopsis, which is not always available. It is important to discover the movie spoilers whether we
have the plot of the movie or not. Therefore, our approach should be independent of external information.
The key is to find out the semantic and stylometric clues of spoilers and characterize them. Hence the
research question comes as
Q1: What are the textual features that distinguish between evaluation comments and movie spoilers?
Q2: How can we establish a classifier to separate movie spoilers from other comments in high accuracy?
Q3: How are the scalability and robustness of such classifier?
Proposed method
The difficulty in this problem lies in that movie spoilers often exists in sentence level(which means just a
part of a whole comment) and you cannot tell them apart by seeing if the sentence is subjective or
objective. Consider the following two sentences are both spoilers:


Objective: In the end the heroine died
Subjective: The death of the heroine was very sad to me.
This means subjective/objective analysis cannot be directly used to distinguish between normal comments
and spoilers. Or you cannot decide upon affects appeared:


Spoiler: The sudden appear of zombie in the beginning scared me!
Normal comment: The film is overall scary.
Even worse, special terms in the movie is used or not cannot be the critical filter.
 The criminal was Kelly Brondson.
 The criminal was who you can hardly imagine—the best friend!
Therefore, a single sentence level feature cannot distinguish between spoilers and normal comments.
Similarly, a single technique cannot solve this problem either. Hence the first step should be finding out a
set of features or clues that are potentially useful in classifying movie reviews.
Features sets are usually constructed to facilitate the classification. Especially, large feature sets will allow
greater discriminatory potential across larger data sets, increasing the scalability and classification
performance. For example, 23 features which were grouped into 5 groups are used in text authorship
analysis (Abbasi & Chen 2008). Another example is large sets of cues employed in detecting fake
websites (Abbasi, Zhang, Zimbra, Chen & Nunamaker 2010). However, to the best of our knowledge,
almost none researchers have identified any feature sets that can used to distinguish movie spoilers. Guo
and Ramakrishnan’s approach was based on LDA topic distribution similarity (Guo & Ramakrishnan 2010),
so the feature they employed can be viewed as only topic proportion of review comment, which is also not
in sentence level. In this study, we will construct a set of clues that may be useful in classification. They are
summarized as Table 2.1
Table 2.1 (Preliminary)Sentence level cues for movie spoiler detection
Sentence Level Clues
description
Example & implication
POS tag ngrams
Sequential Part of Speech Noun-be-noun(criminal was Watson); noun
tags combination in
verbs(heroine died) is likely to be spoilers
sentence
Sentiment polarity
Positive or negative
To be examined
classification
subjectivity
Whether the sentence is
Objective sentence is more likely to be spoiler.
subjective or objective
Tense of verbs
Past tense, feature tense
Spoilers tend to be in past tense, while normal
etc.
comments are not (I like the movie)
movie specific terms
If it contains specific terms Spoilers tend to be have movie specific
to the movie or not, like
terms(Donbrido died at last).
names. We just need to
distinguish if it contains a
word that dictionary
cannot look up.
Feature frequency
The frequency of the
To be examined
subject that appears in a
sentence. It can be
extracted with the method
by Liu and Hu (2004)
Average word length
Sentence length
Average length of the
words
Length of the sentence
To be examined
To be examined
The approach we are going to use in this study can be described as three steps. The datasets we will use
come from IMDB or Yahoo movie reviews.
Feature Selection: In step 1, we will manually collect and pick out spoilers in the sample review set (about
500 comments). Then, the features above will be extracted from both spoilers and non-spoilers, and
compared between two groups. In this step, we will find out which features are more useful in our task.
Classifier Training: In step 2, after determining the feature sets to use, we’ll train a SVM or SVM
ensemble on these features and spoiler tags.
Evaluation: In step 3, we’ll evaluate the classifier on the rest of the data sets. The evaluation metrics are
percentage of spoilers not detected and percentage of normal comments tagged as spoilers.
These steps can be modeled as picture 3.1
Picture 3.1 Proposed model
Feature Selection
Classifier training
Evaluation
Featuresets
(POS tags)
(sentiment polarity
(tense etc.)
SVM
EnsembleSVM
features
Spoiler
Classifier
Feature
Extractor
Movie reviews
Manually
tagged reviews
Potential Outcomes of Research
This research might successfully create an IT artifact that can detect movie spoilers in movie review
comments.
Using this tool, when users write a comment and submit, the system will automatically detect the movie
spoiler sentences in the comment, and hide them from other users. Only when users click on ’I want to see
spoilers’, these sentences can be seen.
References
Guo S., Ramakrishnan N. 2010. Finding the Storyteller: Automatic Spoiler Tagging using Linguistic
Proceedings of the 23rd International Conference on Computational Linguistics, pp.412–420.
Chaovalit P., Zhou L. 2005. Movie Review Mining: a Comparison between Supervised and Unsupervised
Classification Approaches, Proceedings of the Proceedings of the 38th Annual Hawaii International
Conference on System Sciences (HICSS'05) - Track 4, pp.112.3.
Pang B., Lee L., Vaithyanathan S. 2002. Thumbs up?: sentiment classification using machine
learning techniques, Proceedings of the ACL-02 conference on Empirical methods in natural
language processing, pp.79-86
Kennedy A., Inkpen D.2006.Sentiment classification of movie reviews using contextual valence
shifters. Computational Intelligence, vol. 22, pp. 110-125
Zhuang L., Jing f., Zhu X.2006. Movie review mining and summarization, Proceedings of the 15th
ACM international conference on Information and knowledge management.
Eliashberg, J., Hui, S. K., & Zhang, Z. J. 2007. From story line to box office: A new approach for
green-lighting movie scripts. Management Science, 53(6, June), 881–893.
Redden P. 2008. Reducing Satiation: The Role of Categorization Level. Journal of Consumer
Research, 34 (5), 624-34.
Timothy D., Centerbar D., Kermer D.,Gilbert D. 2005. The Pleasure of Uncertainty: Prolonging
Positive Moods in Ways People do Not Anticipate. Journal of Personality and Social Psychology, 88
(1), 5-21.
Cheung K.W., Kwok T.,Law M.H., Tsui K. 2003. Mining customer product ratings for personalized
marketing. Decision Support Systems, vol. 35, pp. 231-243
Turney P.D. 2002. Thumbs up or thumbs down? Semantic orientation applied to unsupervised
classification of reviews. In ACL 2002, pp:417–424.
Jindal N.,Liu B.2006. Identifying comparative sentences in text documents. Proceedings of the 29th
annual international ACM SIGIR conference on Research and development in information retrieval.
Lim E.P., Nguyen V.A., Jindal N.,Liu B., Lauw H.W. 2010.Detecting Product Review Spammers using
Rating Behaviours. Working paper.
Hu M.,Liu B. 2004 Mining and Summarizing Customer Reviews KDD04, Seattle, WA.
Liu B.,Hu M.,Cheng J.2005. Opinion observer: analyzing and comparing opinions on the Web,
Proceedings of the 14th international conference on World Wide Web
Nasukawa T.,Yi J. 2003. Sentiment Analysis: Capturing Favorability Using Natural Language
Processing. Presented at K-CAP`03, Sanibel Island, Florida
Lee T. 2004. Use-centric mining of customer reviews. In WITS.
Dave K., Lawrence.S., Pennock D.M. 2003. Mining the peanut gallery: Opinion extraction and
semantic classification
of product reviews. In WWW12, pp:519–528.
Pang b.,Lee L.2008. Opinion Mining and Sentiment Analysis, Foundations and Trends in Information
Retrieval, v.2 n.1-2, pp:1-135
Abbas A., Zhang Z., Zimbra D., Chen H.,Nunamake J.F. 2010. Detecting Fake Websites: The
Contribution of Statistical Learning Theory MIS Quarterly. 34: 3 pp.435-461.
Abbasi A., Chen H. 2005. Applying authorship analysis to extremist-group Web forum messages.
IEEE Intelligent Systems, 20(5), 67–75.
Download