Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute Warm Up Discussion Look at the analysis I have passed out Note: inscribed sentiment is underlined and invoked sentiment is italicized and relatively frequent words that appear in either of these types of expressions have been marked in bold Do you see any sarcastic comments here? What if any connection do you see between sentiment and sarcasm? Keeping in mind the style of templates you read about, do you see any snippets of text in these examples that you think would make good templates? Patterns I see Inscribed sentiment About 24% of words in underlined segments are relatively high frequency 3 “useful” patterns out of 18 underlined portions Examples: Like Good More CW than CW Invoked sentiment About 39% of words were relatively high frequency About 7 possibly useful patterns out of 17, but only 3 really look unambigous Examples CW like one CW the CW of the CW Makes little CW to CW to the CW CW and CW of an CW Like CW on CW Leave you CW a little CW and CW CW more like a CW Unit 3 Plan 3 papers we will discuss all give ideas for using context (at different grain sizes) Local Using bootstrapping Local patterns without syntax patterns with syntax Using a parser Rhetorical patterns within documents Using a statistical modeling technique The first two papers introduce techniques that could feasibly be used in your Unit 3 assignment Student Comment: Point of Discussion To improve performance language technologies seem to approach the task in either one of two ways. First of all approaches attempt to generate a better abstract model that provides the translation mechanism between a string of terms (sentence) and our human mental model of sentiment in language. Alternatively some start with a baseline and try to find a corpus or dictionary of terms that provides evidence for sentiment. Please clarify Connection between Appraisal and Sarcasm A sarcastic example of invoked negative sentiment from Martin and White, p 72 Student Comment: I’m not exactly sure how one would go about applying appraisal theory to something as elusive as sarcasm. Inscribed versus Invoked Do we see signposts that tell us how to interpret invoked appraisals? Overview of Approach Start with small amount of labeled data Generate patterns from examples Select those that appear in training data more than once and don’t appear both in a 1 and a 5 labeled example Expand data through search using examples from labeled data as queries (take top 50 snippet results) Represent data in terms of templatized patterns Modified kNN classification approach How could you do this with SIDE? 1 Build a feature extractor to generate the set of patterns 2 Use search to set up expanded set of data 3 Apply generated patterns to expanded set of data 4 Use kNN classification Pattern Generation Classify words into high frequency (HFW) versus content words (CW) HFWs occur at least 100 times per million words CW occur no more than 1000 times per million words Also add [product], [company], [title] as additional HFWs Constraints on patterns: 2-6 HFWs, 1-6 slots for CWs, patterns start and end with HFWs Would Appraisal theory suggest other categories of words? Expand Data: “Great for Insomniacs…” What could they have done instead? Pattern Selection Approach: Select those that appear in training data more than once and don’t appear both in a 1 and a 5 labeled example Could have used an attribute selection technique like Chi-squared attribute evaluation What do you see as the trade-offs between these approaches? Representing Data as a Vector Most of the features were from the generated patterns Also included punctuation based features Number of !, number of ?, number of quotes, number capitalized words What other features would you use? What modifications to feature weights would you propose? Modified kNN Is there a simpler approach? Weighted average so majority class matches count more. Evaluation I am …rather wary of the effectiveness of their approach because it seems that they cherry picked a heuristic ‘starsentiment’ baseline to compare their results to in table 3 but do not offer a similar baseline for table 2. Baseline technique: Count as positive examples those that have a highly negative star rating but lots of positive words. Is this really a strong baseline? Look at the examples from the paper. Evaluation What do you conclude from this? What surprises you? Revisit: Overview of Approach Start with small amount of labeled data Generate patterns from examples Select those that appear in training data more than once and don’t appear both in a 1 and a 5 labeled example Expand data through search using examples from labeled data as queries (take top 50 snippet results) Represent data in terms of templatized patterns Modified kNN classification approach How could you do this with SIDE? 1 Build a feature extractor to generate the set of patterns 2 Use search to set up expanded set of data 3 Apply generated patterns to expanded set of data 4 Use kNN classification What would it take to achieve inter-rater reliability? You can find definitions and examples on the website, just like in the book, but it’s not enough… Strategies – are there distinctions that don’t buy us much anyway? Add constraints Identify boarderline cases Use decision trees Simplify What would it take to achieve inter-rater reliability? Look at Beka and Elijah’s analyses in comparison with mine What How were our big disagreements? would we resolve them? Questions?