MIS696A Reading in MIS Research Proposal Detect movie spoilers: An external resource independent approach Candidate ShanJiang Date September 29, 2010 Introduction With more and more common users becoming comfortable with the Web, an increasing number of people are wirting reviews (Hu & Liu, 2004), which provides a rich source of useful information to customers. As a result, people tend to refer to online customer reviews before purchasing some products. It holds true in movie industry. To avoid watching junk movies so as to save time and money, people like to go through the comments from those who have already watched the movies. Usually, the more information contained in reviews, the more customers are able to benefit from the opinions. However, it does not hold water in movie domain. Since products such as movies are intended to bring customers with ‘experiential consumption’(Tsang & Yan, 2010), and the value lies in some kind of unexpected information, the revelation of such information will seriously harm customers’ motivation to purchase the product, and thus profits the products should have made. For example, if people know the criminal in advance in a detective film, they are less likely to enjoy the movie. This type of unwanted information associated with experiential consumption is called spoilers. In the domain of movies, the term movie spoiler is often used. Movie spoilers are very common in existing online movie review websites such as IMDB(www.imdb.com),Yahoo Movie Review(http://movies.yahoo.com) etc. To make things worse, few websites have a mechanism for users to hide movie spoilers(need reference).Nevertheless, most people consulting with movie reviews wish only to know how the movie is rated in quality, but not the plots in detail. Therefore, a way to screening movie spoilers from people who do not like to know too much about the story will be useful. The objective of this study is to find out a method to automatically detect movie spoilers from movie review corpora. Text mining approaches including opinion mining, sentiment analysis and affect analysis will be employed. Related Work Mining and analyzing customer review has been popular research topic in text mining. One thread of research focuses on the structured part of the review, i.e. numerical rating data (Cheung, Kwok, Law & Tsui, 2003; Jindal & Liu,2006). These studies were aimed at ranking the products according to the quality reflected by reviews((Cheung, Kwok, Law & Tsui, 2003) or detecting the review spammers(Jindal & Liu,2006;Lim, Nguyen,Jindal, Liu & Lauw 2010). More researches were conducted on applying text mining techniques to extract sentiments and features from unstructured comments from the body of customer reviews(Nasukawa & Yi, 2003;Lee.T, 2004;Hu & Liu, 2004), and most researches focused on sentiment classification(Turney 2002;Dave, Lawrence & Pencock,2003). An important finding is that classifiers works well on whole review but not in sentence level because of too little information (Lawrence & Pencock,2003). This indicates that sentence level analysis cannot be performed with simple sentiment analysis. Nevertheless, Semi-structured comment is proposed as containing separated pros and cons in each comment body(Liu, Hu & Cheng,2005) or comments are given for different product features(Hu & Liu,2004; Liu, Hu & Cheng,2005). In these works, researchers strived to analyze the comments in sentence level to convert unstructured customer reviews into a more organized and feature-classified comment summary. A survey that covers techniques and approaches regarding opinion mining and sentiment analysis can be found in(Pang & Lee,2008). Movie review mining is a special type of customer review analysis that uses movie review as datasets. It distinguish itself from other kinds of reviews in that factual information is always mixed with real-life review data and ironic words are used in writing reviews(Chaovalit & Zhou, 2005), which might cause misleading effect in sentiment analysis or affect analysis. Past research mainly focused on the tasks of classification of movie reviews according to expressed sentiment(Pang, Lee & Vaithyanathan,2002; Chaovalit & Zhou, 2005; Kennedy & Inkpen, 2006) or automatic review summarization(Zhuang, Jing & Zhu, 2006). Techniques used include multi-knowledge based approaches(Zhuang et al. 2006),machine learning(Pang et al.,2002; Kennedy & Inkpen, 2006;Chavolit & Zhou, 2005) and semantic orientation(Kennedy & Inkpen, 2006;Chavolit & Zhou, 2005). They can be summarized as Table 1.1. Table 1.1 A summary of movie review research. Pang et.al., 2002 Chaovalit & Zhou 2005 Kennedy & Inkpen, 2006 Zhuang et al. 2006 Objective Sentiment Classification Sentiment Classification Sentiment Classification Summarization Techniques Used Machine learning Machine learning, Semantic Orientation Machine learning, Semantic Orientation Multi-knowledge based However, very few researchers have mentioned the concept of movie spoilers in past study. Formally, the spoiler effect denotes a phenomenon that a consumer’s interest in consuming a particular narrative is reduced after exposure to a spoiler(Tsang & Yan, 2010).Spoiler exposure creates a satiation effect (Redden, 2008) and an explanation effect (Wilson, Centerbar, Kermer, and Gilbert, 2005), which reduce the intent of the customers to actually purchase the information providing product. Among these studies, semantic information regarding spoilers is rarely used. In a research on return-on-investment of movie studio, textual information in movie spoilers are used just as a substitution of movie scripts due to the difficulty of directly obtaining manuscripts in electronic form (Eliashberg, Hui & Zhang, 2006). Only one study is related to automatically detecting the movie spoilers (Guo & Ramakrishnan, 2010). Research Gap and Research Question Guo and Ramakrishnan developed a topic model based on Latent Dirichlet Allocation (LDA) to detect comments and reviews containing spoilers by comparing them with movies’ synopsis coming from IMDB, and then rank them according to the similarity with synopsis. In general, the closer to the synopsis in topic, the more likely a review contains spoilers. However, this approach is limited to the case with existence of movies’ synopsis, which is not always available. It is important to discover the movie spoilers whether we have the plot of the movie or not. Therefore, our approach should be independent of external information. The key is to find out the semantic and stylometric clues of spoilers and characterize them. Hence the research question comes as Q1: What are the textual features that distinguish between evaluation comments and movie spoilers? Q2: How can we establish a classifier to separate movie spoilers from other comments in high accuracy? Q3: How are the scalability and robustness of such classifier? Proposed method The difficulty in this problem lies in that movie spoilers often exists in sentence level(which means just a part of a whole comment) and you cannot tell them apart by seeing if the sentence is subjective or objective. Consider the following two sentences are both spoilers: Objective: In the end the heroine died Subjective: The death of the heroine was very sad to me. This means subjective/objective analysis cannot be directly used to distinguish between normal comments and spoilers. Or you cannot decide upon affects appeared: Spoiler: The sudden appear of zombie in the beginning scared me! Normal comment: The film is overall scary. Even worse, special terms in the movie is used or not cannot be the critical filter. The criminal was Kelly Brondson. The criminal was who you can hardly imagine—the best friend! Therefore, a single sentence level feature cannot distinguish between spoilers and normal comments. Similarly, a single technique cannot solve this problem either. Hence the first step should be finding out a set of features or clues that are potentially useful in classifying movie reviews. Features sets are usually constructed to facilitate the classification. Especially, large feature sets will allow greater discriminatory potential across larger data sets, increasing the scalability and classification performance. For example, 23 features which were grouped into 5 groups are used in text authorship analysis (Abbasi & Chen 2008). Another example is large sets of cues employed in detecting fake websites (Abbasi, Zhang, Zimbra, Chen & Nunamaker 2010). However, to the best of our knowledge, almost none researchers have identified any feature sets that can used to distinguish movie spoilers. Guo and Ramakrishnan’s approach was based on LDA topic distribution similarity (Guo & Ramakrishnan 2010), so the feature they employed can be viewed as only topic proportion of review comment, which is also not in sentence level. In this study, we will construct a set of clues that may be useful in classification. They are summarized as Table 2.1 Table 2.1 (Preliminary)Sentence level cues for movie spoiler detection Sentence Level Clues description Example & implication POS tag ngrams Sequential Part of Speech Noun-be-noun(criminal was Watson); noun tags combination in verbs(heroine died) is likely to be spoilers sentence Sentiment polarity Positive or negative To be examined classification subjectivity Whether the sentence is Objective sentence is more likely to be spoiler. subjective or objective Tense of verbs Past tense, feature tense Spoilers tend to be in past tense, while normal etc. comments are not (I like the movie) movie specific terms If it contains specific terms Spoilers tend to be have movie specific to the movie or not, like terms(Donbrido died at last). names. We just need to distinguish if it contains a word that dictionary cannot look up. Feature frequency The frequency of the To be examined subject that appears in a sentence. It can be extracted with the method by Liu and Hu (2004) Average word length Sentence length Average length of the words Length of the sentence To be examined To be examined The approach we are going to use in this study can be described as three steps. The datasets we will use come from IMDB or Yahoo movie reviews. Feature Selection: In step 1, we will manually collect and pick out spoilers in the sample review set (about 500 comments). Then, the features above will be extracted from both spoilers and non-spoilers, and compared between two groups. In this step, we will find out which features are more useful in our task. Classifier Training: In step 2, after determining the feature sets to use, we’ll train a SVM or SVM ensemble on these features and spoiler tags. Evaluation: In step 3, we’ll evaluate the classifier on the rest of the data sets. The evaluation metrics are percentage of spoilers not detected and percentage of normal comments tagged as spoilers. These steps can be modeled as picture 3.1 Picture 3.1 Proposed model Feature Selection Classifier training Evaluation Featuresets (POS tags) (sentiment polarity (tense etc.) SVM EnsembleSVM features Spoiler Classifier Feature Extractor Movie reviews Manually tagged reviews Potential Outcomes of Research This research might successfully create an IT artifact that can detect movie spoilers in movie review comments. Using this tool, when users write a comment and submit, the system will automatically detect the movie spoiler sentences in the comment, and hide them from other users. Only when users click on ’I want to see spoilers’, these sentences can be seen. References Guo S., Ramakrishnan N. 2010. Finding the Storyteller: Automatic Spoiler Tagging using Linguistic Proceedings of the 23rd International Conference on Computational Linguistics, pp.412–420. Chaovalit P., Zhou L. 2005. Movie Review Mining: a Comparison between Supervised and Unsupervised Classification Approaches, Proceedings of the Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS'05) - Track 4, pp.112.3. Pang B., Lee L., Vaithyanathan S. 2002. Thumbs up?: sentiment classification using machine learning techniques, Proceedings of the ACL-02 conference on Empirical methods in natural language processing, pp.79-86 Kennedy A., Inkpen D.2006.Sentiment classification of movie reviews using contextual valence shifters. Computational Intelligence, vol. 22, pp. 110-125 Zhuang L., Jing f., Zhu X.2006. Movie review mining and summarization, Proceedings of the 15th ACM international conference on Information and knowledge management. Eliashberg, J., Hui, S. K., & Zhang, Z. J. 2007. From story line to box office: A new approach for green-lighting movie scripts. Management Science, 53(6, June), 881–893. Redden P. 2008. Reducing Satiation: The Role of Categorization Level. Journal of Consumer Research, 34 (5), 624-34. Timothy D., Centerbar D., Kermer D.,Gilbert D. 2005. The Pleasure of Uncertainty: Prolonging Positive Moods in Ways People do Not Anticipate. Journal of Personality and Social Psychology, 88 (1), 5-21. Cheung K.W., Kwok T.,Law M.H., Tsui K. 2003. Mining customer product ratings for personalized marketing. Decision Support Systems, vol. 35, pp. 231-243 Turney P.D. 2002. Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In ACL 2002, pp:417–424. Jindal N.,Liu B.2006. Identifying comparative sentences in text documents. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. Lim E.P., Nguyen V.A., Jindal N.,Liu B., Lauw H.W. 2010.Detecting Product Review Spammers using Rating Behaviours. Working paper. Hu M.,Liu B. 2004 Mining and Summarizing Customer Reviews KDD04, Seattle, WA. Liu B.,Hu M.,Cheng J.2005. Opinion observer: analyzing and comparing opinions on the Web, Proceedings of the 14th international conference on World Wide Web Nasukawa T.,Yi J. 2003. Sentiment Analysis: Capturing Favorability Using Natural Language Processing. Presented at K-CAP`03, Sanibel Island, Florida Lee T. 2004. Use-centric mining of customer reviews. In WITS. Dave K., Lawrence.S., Pennock D.M. 2003. Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. In WWW12, pp:519–528. Pang b.,Lee L.2008. Opinion Mining and Sentiment Analysis, Foundations and Trends in Information Retrieval, v.2 n.1-2, pp:1-135 Abbas A., Zhang Z., Zimbra D., Chen H.,Nunamake J.F. 2010. Detecting Fake Websites: The Contribution of Statistical Learning Theory MIS Quarterly. 34: 3 pp.435-461. Abbasi A., Chen H. 2005. Applying authorship analysis to extremist-group Web forum messages. IEEE Intelligent Systems, 20(5), 67–75.