Natural Language Processing COMPSCI 423/723 Rohit Kate Discourse Processing Reference: Jurafsky & Martin book, Chapter 21 Basic Steps of Natural Language Processing Sound waves Words Phonetics • Syntactic processing Parses Semantic processing Meaning This is a conceptual pipeline, humans or computers may process multiple stages simultaneously Dsicourse Meaning processing in context Discourse • So far we always analyzed one sentence in isolation, syntactically and/or semantically • Natural languages are spoken or written as a collection of sentences • In general, a sentence cannot be understood in isolation: – Today was Jack's birthday. Penny and Janet went to the store. They were going to get presents. Janet decided to get a kite. "Don't do that," said Penny. "Jack has a kite. He will make you take it back.” Discourse • Discourse is a coherent structured group of sentences – Example: monologues (including reading passages), dialogues • Very little work has been done in understanding beyond a sentence, i.e. understanding a whole paragraph or an entire document together • Important tasks in processing a discourse – Discourse Segmentation – Determining Coherence Relations – Anaphora Resolution • Ideally deep understanding is needed to do well on these tasks but so far shallow methods have been used Discourse Segmentation • Discourse Segmentation: Separating a document into a linear sequence of subtopics – For example: scientific articles are segmented into Abstract, Introduction, Methods, Results, Conclusions – This is often a simplification of a higher level structure of a discourse • Applications of automatic discourse segmentation: – For Summarization: Summarize each segment separately – For Information Retrieval or Information Extraction: Apply to an appropriate segment • Related task: Paragraph segmentation, for example of a speech transcript Unsupervised Discourse Segmentation • Given raw text, segment it into multiple paragraph subtopics • Unsupervised: No training data is given for the task • Cohesion-based approach: Segment into subtopics in which sentences/paragraphs are cohesive with each other; A dip is cohesion at subtopic boundaries Cohesion • Cohesion: Links between text units due to linguistic devices • Lexical Cohesion: Use of same or similar words to link text units – Today was Jack's birthday. Penny and Janet went to the store. They were going to get presents. Janet decided to get a kite. "Don't do that," said Penny. "Jack has a kite. He will make you take it back.” • Non-lexical Cohesion: For example, using anaphora Cohesion-based Unsupervised Discourse Segmentation • TextTiling algorithm (Hearst, 1997) – compare adjacent blocks of text – look for shifts in vocabulary • Do pre-processing: Tokenization, remove stop words, stemming • Divide text into pseudo-sentences of equal length (say 20 words) Cohesion-based Unsupervised Discourse Segmentation • TextTiling algorithm (Hearst, 1997) – compare adjacent blocks of text – look for shifts in vocabulary • Do pre-processing: Tokenization, remove stop words, stemming • Divide text into pseudo-sentences of equal length (say 20 words) TextTiling Algorithm contd. • Compute lexical cohesion score at each gap between pseudo-sentences • Lexical cohesion score: Similarity of words before and after the gap (take say 10 pseudosentences before and 10 pseudo-sentences after) • Similarity: Cosine similarity between the word vectors (high if words co-occur) Gap TextTiling Algorithm contd. • Compute lexical cohesion score at each gap between pseudo-sentences • Lexical cohesion score: Similarity of words before and after the gap (take say 10 pseudosentences before and 10 pseudo-sentences after) • Similarity: Cosine similarity between the word vectors (high if words co-occur) Similarity Gap TextTiling Algorithm contd. • Plot the similarity and compute the depth scores of the “similarity valleys”, (a-b)+(c-b) • Assign segmentation if the depth score is larger than a threshold (e.g. one standard deviation deeper than mean valley depth) valley a b c TextTiling Algorithm contd. • Plot the similarity and compute the depth scores of the “similarity valleys”, (a-b)+(c-b) • Assign segmentation if the depth score is larger than a threshold (e.g. one standard deviation deeper than mean valley depth) TextTiling Algorithm contd. From (Hearst, 1994) Supervised Discourse Segmentation • Easy to get supervised data for some segmentation tasks – For e.g., paragraph segmentation – Useful to find paragraphs in speech recognition output Supervised Discourse Segmentation • Easy to get supervised data for some segmentation tasks – For e.g., paragraph segmentation – Useful to find paragraphs in speech recognition output • Model as a classification task: Classify if the sentence boundary is a paragraph boundary – Use any classifier SVM, Naïve Bayes, Maximum Entropy etc. Supervised Discourse Segmentation • Easy to get supervised data for some segmentation tasks – For e.g., paragraph segmentation – Useful to find paragraphs in speech recognition output • Model as a classification task: Classify if the sentence boundary is a paragraph boundary – Use any classifier SVM, Naïve Bayes, Maximum Entropy etc. • Or model as a sequence labeling task: Label a sentence boundary with “paragraph boundary” or “not a paragraph boundary label” Supervised Discourse Segmentation • Features: – Use cohesion features: word overlap, word cosine similarity, anaphoras etc. – Additional features: Discourse markers or cue word • Discourse marker or cue phrase/word: A word or phrase that signal discourse structure – For example, “good evening”, “joining us now” in broadcast news – “Coming up next” at the end of a segment, “Company Incorporated” at the beginning of a segment etc. – Either hand-code or automatically determine by feature selection Discourse Segmentation Evaluation • Not a good idea to measure precision, recall and Fmeasure because that won’t be sensitive to near misses • One good metric WindowDiff (Pevzner & Hearst, 2002) • Slide a window of length k across the reference (correct) and the hypothesized segmentation and count the number of segmentation boundaries in each • WindowDiff metric: Average difference in the number of boundaries in the sliding window Text Coherence • A collection of independent sentences do not make a discourse because they lack coherence • Coherence: Meaning relation between two units of text; explains how the meaning of different units of text combine to build meaning of the larger unit (to contrast, cohesion is links between units) Explanation John hid Bill’s car keys. He was drunk. ??? John hid Bill’s car keys. He likes spinach. • Humans try to find coherence between sentences all the time Coherence Relations • Coherence Relations: Set of connections between units in a discourse. • A few more such relations, Hobbs (1979): Result The Tin Woodman was caught in the rain. His joints rusted. Parallel The scarecrow wanted some brains. The Tin Woodman wanted a heart. Elaboration Dorothy was from Kansas. She lived in the midst of the great Kansas prairies. Occasion Dorothy picked up the oil-can. She oiled the Tin Woodman’s joints. Discourse Structure • Discourse Structure: The hierarchical structure of a discourse according to the coherence relations. John went to the bank to deposit his paycheck. He then took a train to Bill’s car dealership. He needed to buy a car. The company he works for now isn’t near any public transportation. He also wanted to talk to Bill about their softball league. Discourse Structure • Discourse Structure: The hierarchical structure of a discourse according to the coherence relations. Occasion Explanation John went to the bank to deposit his paycheck. Parallel He then took a train to Bill’s car dealership. Explanation He needed to buy a car. He also wanted to talk to Bil l about their softball league. The company he works for now isn’t near any public transportation. • Analogous to syntactic tree structure • A node in a tree represents locally coherent sentences: discourse segment (not linear) Discourse Structure • What are the uses of discourse structure? – Summarization systems may skip or merge the segment connected with Elaboration relation – Question-answering systems can search in segments with Explanation relations – Information extraction system need not merge information from segments not linked by relations – A semantic parser may build a larger meaning representation of the whole discourse Discourse Parsing • Coherence Relation Assignment: Automatically determining the coherence relations between units of a discourse • Discourse Parsing: Automatically finding the discourse structure of an entire discourse • Both are largely unsolved problems, but some shallow methods work to some degree, for example, using cue phrases (or discourse markers) Automatic Coherence Assignment Shallow cue-phrase-based algorithm: 1. Identify cue phrases in a text 2. Segment text into discourse segments using cue phrases 3. Assign coherence relations between consecutive discourse segments 1. Identify Cue Phrases • Phrases that signal discourse structure, e.g. “joining us now”, “coming up next” etc. • Connectives: “because”, “although”, “example”, “with”, “and” • However, their occurrence is not always indicative of discourse relation: they are ambiguous – With its distant orbit, Mars exhibits frigid weather conditions – We can see Mars with an ordinary telescope • Use some simple heuristics, e.g. capitalization of with, etc. but in general use techniques similar to word sense disambiguation 2. Segment Text into Discourse Segments • Usually sentences so may suffice to to sentence segmentation • However, often clauses are more appropriate Explanation – With its distant orbit, Mars exhibits frigid weather conditions • Use hand-written rules or utilize syntactic parses to get such segments 3. Classify Relation between Neighboring Segments • Use rules based on the cue phrases and connectives – For example, a sentence beginning with “Because” indicates Explanation relation with the next segment • Train classifiers using appropriate features Drawback of Cue-phrase-based Algorithm • Sometimes relations are not signaled by cue phrases but are implicit through syntax, words, negation etc.: Contrast – I don’t want a truck. I’d prefer a convertible. • Difficult to encode such rules these manually or to get labeled training examples • One solution: Automatically find easy examples with cue phrases then remove the cue phrases to generate difficult supervised training examples – I don’t want a truck although I’d prefer a convertible. Drawback of Cue-phrase-based Algorithm • Sometimes relations are not signaled by cue phrases but are implicit through syntax, words, negation etc.: Contrast – I don’t want a truck. I’d prefer a convertible. • Difficult to encode such rules these manually or to get labeled training examples • One solution: Automatically find easy examples with cue phrases then remove the cue phrases to generate difficult supervised training examples – I don’t want a truck. I’d prefer a convertible. • Train using words, word pairs, POS tags, etc. as features Penn Discourse Treebank • Recently released corpus that is likely to lead to better systems for discourse processing • Has coherence relations encoded associated with the discourse connectives • Linked to the Penn Treebank http://www.seas.upenn.edu/~pdtb/ Reference Resolution • Reference Resolution: The task of determining what entities are referred to by which linguistic expressions • To understand any discourse it is necessary to know which entities are being talked about at which point Mr. Obama visited the city. The president talked about Milwaukee’s economy. He mentioned new jobs. – “Mr.Obama”, “The president” and “He” are referring expressions for referent “Barack Obama” and they corefer – Anaphora: When a referring expression refers to a previously introduced entity (antecedent), the referring expression is called anaphoric, e.g. “The president”, “He” – Cataphora: When a referring expression refers to an entity which is introduced later, the referring expression is called cataphoric, e.g. “the city” Two Reference Resolution Tasks • Coreference Resolution: The task of finding referring expressions that refer to the same entity, i.e. find coreference chain – In the previous example the coreference chains are: {Mr. Obama, The president, he}, {the city, Milwaukee’s} • Pronominal Anaphora Resolution: The task of finding the antecedent for a single pronoun – In the previous example, “he” refers to “Mr. Obama” • A lot of work has been done in these tasks in the last 15 or so years [Ng, 2010] Supervised Pronominal Anaphora Resolution • Given a pronoun and an entity mentioned earlier, classify whether the pronoun refers to that entity or not given the surrounding context ? ? ? Mr. Obama visited the city. The president talked about Milwaukee’s economy. He mentioned new jobs. • First filter out pleonastic pronouns like “It is raining.” using hand-written rules • Use any classifier, obtain positive examples from training data, generate negative examples by pairing each pronouns with other (incorrect) entities Features for Pronominal Anaphora Resolution • Constraints: – Number agreement • Singular pronouns (it/he/she/his/her/him) refer to singular entities and plural pronouns (we/they/us/them) refer to plural entities – Person agreement • He/she/they etc. must refer to a third person entity – Gender agreement • He -> John; she -> Mary; it -> car – Certain syntactic constraints • John bought himself a new car. [himself -> John] • John bought him a new car. [him can not be John] Features for Pronominal Anaphora Resolution • Preferences: – Recency: More recently mentioned entities are more likely to be referred to • John went to a movie. Jack went as well. He was not busy. – Grammatical Role: Entities in the subject position is more likely to be referred to than entities in the object position • John went to a movie with Jack. He was not busy. – Parallelism: • John went with Jack to a movie. Joe went with him to a bar. Features for Pronominal Anaphora Resolution • Preferences: – Verb Semantics: Certain verbs seem to bias whether the subsequent pronouns should be referring to their subjects or objects • John telephoned Bill. He lost the laptop. • John criticized Bill. He lost the laptop. – Selectional Restrictions: Restrictions because of semantics • John parked his car in the garage after driving it around for hours. • Encode all these and may be more as features Coreference Resolution • Can be done analogously to pronominal anaphora resolution: Given an anaphor and a potential antecedent, classify as true or false • Some approaches also do clustering on the referring expressions instead of doing binary classification • Additional features to incorporate aliases, variations in names etc., e.g. Mr. Obama, Barack Obama; Megabucks, Megabucks Inc.