CSA3180: Natural Language Processing Information Extraction 2 • Named Entities • Question Answering • Anaphora Resolution • Co-Reference December 2005 CSA3180: Information Extraction II 1 Introduction • Slides partially based on talk by Lucian Vlad Lita • Sheffield GATE Multilingual Extraction slides based on Diana Maynard’s talks • Anaphora resolution slides based on Dan Cristea slides, with additional input from Gabriela-Eugenia Dima, Oana Postolache and Georgiana Puşcaşu December 2005 CSA3180: Information Extraction II 2 References • Fastus System Documentation • Robert Gaizauskas “IE Perspective on Text Mining” • Daniel Bikel’s “Nymble: A High Performance Learning Name Finder” • Helena Ahonen-Myka’s notes on FSTs • Javelin system documentation • MUC 7 Overview & Results December 2005 CSA3180: Information Extraction II 3 Named Entities • Named Entities • Person Name: Colin Powell, Frodo • Location Name: Middle East, Aiur • Organization: UN, DARPA • Domain Specific vs. Open Domain December 2005 CSA3180: Information Extraction II 4 Nymble (BBN Corporation) • • • • State of the art system Near-human performance ~90% accuracy Statistical system Approach: Hidden Markov Model (HMM) December 2005 CSA3180: Information Extraction II 5 Nymble (BBN Corporation) • Noisy channel paradigm • Originally, entities were marked in the raw text • Post noisy channel, annotation is lost • Probability of most likely sequence of name classes (NC) given a sequence of words (W) Pr(NC|W) = Pr(W,NC) / Pr(W) since the a priori probability of the word sequence can be considered constant for any given sentence maximize just numerator December 2005 CSA3180: Information Extraction II 6 Nymble (BBN Corporation) Person Start of Sentence End of Sentence Organization Five other classes Not-A-Name December 2005 CSA3180: Information Extraction II 7 Automatic Content Extraction • DARPA ACE Program • Identify Entities – Named: Bilbo, San Diego, UNICEF – Nominal: the president, the hobbit – Pronominal: she • Reference resolution – Clinton the president he December 2005 CSA3180: Information Extraction II 8 Question Answering The over-used pipeline paradigm: Question Question Analysis Information Retrieval Answer December 2005 CSA3180: Information Extraction II Answer Extraction Answer Merging 9 Question Answering • Feedback loops can be present for constraint relaxation purposes • Not all QA systems adhere to the pipeline architecture • Question answering flavors – Factoid vs. complex – Who invented paper? vs. Which of Mr. Bush’s friends are Black Sabbath fans? – Closed vs. open domain December 2005 CSA3180: Information Extraction II 10 Answer Extraction The over-used pipeline paradigm: Question Question Analysis Information Retrieval Answer Answer Extraction Answer Merging • Focus on open domain, factoid question answering December 2005 CSA3180: Information Extraction II 11 Practical Issues • Web Spell Checking – Mispling – nucular • Infrequent forms: – Niagra vs. Niagara, Filenes vs. Filene’s • Google QA • Genome, video, games December 2005 CSA3180: Information Extraction II 12 Practical Issues • Traditional Information Extraction • Either expert built or statistical • Specific strategies for specific question types • Person Bio vs. Location question types • Ability to generalize to new questions and new question types December 2005 CSA3180: Information Extraction II 13 Practical Issues • Who invented Blah? Blah was invented by PersonName Blah was Verb by PersonName where Verb is synonym to invented Blah VerbPhrase by PersonName December 2005 CSA3180: Information Extraction II 14 Popular Resources • • • • • • • • Experts and/or Learning Algorithms Gazeteers NE taggers Part Of Speech taggers Parsers Wordnet Stopword list Stemmer December 2005 CSA3180: Information Extraction II 15 Javelin Answer Extraction • Statistical based system (IX) – Decision Tree – KNN • XML wrapper • Simple features – POS – NE tagger – Lexical items December 2005 CSA3180: Information Extraction II 16 Statistical IX Input • Raw question • Where is Frodo from? • Analyzed question • KEYWORD: Frodo • ATYPE: Location • QTYPE: WhereIsFrom • Relevant document set • NewYorkTimes 203214 • AssociatedPress 273545 December 2005 CSA3180: Information Extraction II 17 Statistical IX Output • Set of candidate answers • Corresponding passages • Confidence score • PASSAGE 1: Frodo grew up in Pittsburgh working in the steel mills … • ANSWER 1: Pittsburgh • CONFIDENCE 1: .0924 • PASSAGE 2: …she met Frodo, the French chef, on Tech Street … • ANSWER 2: French • CONFIDENCE 2: .0493 December 2005 CSA3180: Information Extraction II 18 IX Decision Tree Relevant Verb Present No Yes Average Distance between QTerms and ATerms 7 Yes December 2005 No More than 50% QTerms Present Yes CSA3180: Information Extraction II No 19 IX Training Data • Positive examples – some correct question and answer pairs • Negative examples? December 2005 CSA3180: Information Extraction II 20 IX Training Data • Positive examples – some correct question and answer pairs • Negative examples • All other sentences that do not contain the answer to the question. – Q: In what state can you find “Stop Except When Right Turn” signs? – A: Pennsylvania has SEWRD traffic signs. December 2005 CSA3180: Information Extraction II 21 IX Training Data • Positive examples – some correct question and answer pairs • Negative examples • All other sentences that do not contain the answer to the question. • Sentences that contain the answer but do not actually answer the question – Q: In what state can you find “Stop Except Right Turn” signs? – A: The New York born driver resented SERT traffic signs. December 2005 CSA3180: Information Extraction II 22 Text Quality and Comparisons • Sentence Splitting • Word Casing • Distortion: negation, opinion, past event, ordering • Relaxation • Ambiguity December 2005 CSA3180: Information Extraction II 23 Text Quality and Comparisons • Sentence Splitting – Period, Exclamation Mark, Question Mark – Ambiguities: Calif., Mr. – Deeper ambiguities: No., A. – Assuming abbreviations are detected – I’d like to live in Southern Calif. where it never rains. – I’d like to live in Southern Calif. It never rains there. – Rule based sentence splitter and statistical models December 2005 CSA3180: Information Extraction II 24 Text Quality and Comparisons • Word Casing – News Stories – news source & style • … Prime Minister Blair … • … prime minister Blair … – Headlines, Titles, Teasers • Six Fired In New York Q-Mart Scandal – Broadcast news transcription errors • President George bush has … December 2005 CSA3180: Information Extraction II 25 Text Quality and Comparisons • Distortions – Negation • Frodo was not skiing in the Aspens last winter. • Common Misconceptions about Warcraft III – Opinion • I really believe that Santa Claus exists. – Past Event • Long time ago, people did indeed live in caves. – Ordinal numerals • Aretha Franklin was the third woman to invent the soul. December 2005 CSA3180: Information Extraction II 26 Text Quality and Comparisons • Relaxation – Reason • Not enough documents • No answer found – Method • Query expansion • Synonymy based pattern expansion December 2005 CSA3180: Information Extraction II 27 Text Quality and Comparisons • Relaxation – Reason • Not enough documents • No answer found – Method • Query expansion • Synonymy based pattern expansion – Pros: • Invent can be extended to create and discover – Cons • Invent can be extended to Martha Stewart if enough documents say that she re-invented herself December 2005 CSA3180: Information Extraction II 28 Text Quality and Comparisons • Ambiguity – Word level • Raptors can be found in Pittsburgh on Forbes Ave. – Velociraptors – Motorcycles – Reference resolution • Frodo blah blah Sam blah blah, who blah blah December 2005 CSA3180: Information Extraction II 29 Multi-Source and Multi-Lingual IE • With traditional query engines, getting the facts can be hard and slow • Where has the Queen visited in the last year? • Which places on the East Coast of the US have had cases of West Nile Virus? • Constructing a database through IE and linking it back to the documents can provide a valuable alternative search tool. • Even if results are not always accurate, they can be valuable if linked back to the original text December 2005 CSA3180: Information Extraction II 30 Multi-Source and Multi-Lingual IE • For access to news • identify major relations and event types (e.g. within foreign affairs or business news) • For access to scientific reports • identify principal relations of a scientific subfield (e.g. pharmacology, genomics) December 2005 CSA3180: Information Extraction II 31 Application Example - KIM Ontotext’s KIM query and results December 2005 CSA3180: Information Extraction II 32 Application Example - GATE December 2005 CSA3180: Information Extraction II 33 Complex Problems in NE • Issues of style, structure, domain, genre etc. • Punctuation, spelling, spacing, formatting Dept. of Computing and Maths Manchester Metropolitan University Manchester United Kingdom > Tell me more about Leonardo > Da Vinci December 2005 CSA3180: Information Extraction II 34 Approaches • Knowledge Engineering Learning Systems • rule based • developed by experienced language engineers • make use of human intuition • require only small amount of training data • development can be very time consuming • some changes may be hard to accommodate • December 2005 • • • use statistics or other machine learning developers do not need LE expertise require large amounts of annotated training data some changes may require reannotation of the entire training corpus CSA3180: Information Extraction II 35 Shallow Parsing • Internal evidence – names often have internal structure. These components can be either stored or guessed, e.g. location: Cap. Word + {City, Forest, Center, River} e.g. Sherwood Forest Cap. Word + {Street, Boulevard, Avenue, Crescent, Road} e.g. Portobello Street December 2005 CSA3180: Information Extraction II 36 Shallow Parsing • Ambiguously capitalised words (first word in sentence) [All American Bank] vs. All [State Police] • Semantic ambiguity "John F. Kennedy" = airport (location) "Philip Morris" = organisation • Structural ambiguity [Cable and Wireless] vs. [Microsoft] and [Dell] [Center for Computational Linguistics] vs. message from [City Hospital] for [John Smith] December 2005 CSA3180: Information Extraction II 37 Shallow Parsing + Context • Use of context-based patterns is helpful in ambiguous cases • "David Walton" and "Goldman Sachs" are indistinguishable • But with the phrase "David Walton of Goldman Sachs" and the Person entity "David Walton" recognised, we can use the pattern "[Person] of [Organization]" to identify "Goldman Sachs“ correctly. December 2005 CSA3180: Information Extraction II 38 Shallow Parsing + Context • Use KWIC index and concordancer to find windows of context around entities • Search for repeated contextual patterns of either strings, other entities, or both • Manually post-edit list of patterns, and incorporate useful patterns into new rules • Repeat with new entities December 2005 CSA3180: Information Extraction II 39 Context Patterns • • • • • • • • • • • • • • • [PERSON] earns [MONEY] [PERSON] joined [ORGANIZATION] [PERSON] left [ORGANIZATION] [PERSON] joined [ORGANIZATION] as [JOBTITLE] [ORGANIZATION]'s [JOBTITLE] [PERSON] [ORGANIZATION] [JOBTITLE] [PERSON] the [ORGANIZATION] [JOBTITLE] part of the [ORGANIZATION] [ORGANIZATION] headquarters in [LOCATION] price of [ORGANIZATION] sale of [ORGANIZATION] investors in [ORGANIZATION] [ORGANIZATION] is worth [MONEY] [JOBTITLE] [PERSON] [PERSON], [JOBTITLE] December 2005 CSA3180: Information Extraction II 40 Context Patterns • Patterns are only indicators based on likelihood • Can set priorities based on frequency thresholds • Need training data for each domain • More semantic information would be useful (e.g. to cluster groups of verbs) December 2005 CSA3180: Information Extraction II 41 Case Study: MUSE • MUSE: MUlti-Source Entity Recognition • An IE system developed within GATE • Performs NE and coreference on different text types and genres • Uses knowledge engineering approach with hand-crafted rules • Performance rivals that of machine learning methods • Easily adaptable December 2005 CSA3180: Information Extraction II 42 MUSE Modules • • • • • • • • Document format and genre analysis Tokenisation Sentence splitting POS tagging Gazetteer lookup Semantic grammar Orthographic coreference Nominal and pronominal coreference December 2005 CSA3180: Information Extraction II 43 Switching Controller • Rather than have a fixed chain of processing resources, choices can be made automatically about which modules to use • Texts are analysed for certain identifying features which are used to trigger different modules • For example, texts with no case information may need different POS tagger or gazetteer lists • Not all modules are language-dependent, so some can be reused directly December 2005 CSA3180: Information Extraction II 44 Multilingual MUSE • MUSE has been adapted to deal with different languages • Currently systems for English, French, German, Romanian, Bulgarian, Russian, Cebuano, Hindi, Chinese, Arabic • Separation of language-dependent and language-independent modules and submodules • Annotation projection experiments December 2005 CSA3180: Information Extraction II 45 IE in Surprise Languages • Adaptation to an unknown language in a very short timespan • Cebuano: – Latin script, capitalisation, words are spaced – Few resources and little work already done – Medium difficulty • Hindi: – Non-Latin script, different encodings used, no capitalisation, words are spaced – Many resources available – Medium difficulty December 2005 CSA3180: Information Extraction II 46 Multilingual IE Requirements • Extensive support for non-Latin scripts and text encodings, including conversion utilities – Automatic recognition of encoding – Occupied up to 2/3 of the TIDES Hindi effort • Bilingual dictionaries • Annotated corpus for evaluation • Internet resources for gazetteer list collection (e.g., phone books, yellow pages, bi-lingual pages) December 2005 CSA3180: Information Extraction II 47 Multilingual Data Editing GATE Unicode Kit (GUK) Complements Java’s facilities • Support for defining Input Methods (IMs) • currently 30 IMs for 17 languages • Pluggable in other applications (e.g. JEdit) December 2005 CSA3180: Information Extraction II 48 Multilingual IE Processing All processing, visualisation and editing tools use GUK December 2005 CSA3180: Information Extraction II 49 Anaphora Resolution unprocessed text annotation tool AR golden standard AR engine AR annotated text fine-tuning comparison & evaluation December 2005 CSA3180: Information Extraction II 50 Anaphora Resolution • Text: – Nature of discourse – Anaphoric phenomena • Anaphora Resolution Engines: – Models – General AR Frameworks – Knowledge Sources December 2005 CSA3180: Information Extraction II 51 Anaphora Resolution Anaphora represents the relation between a “proform” (called an “anaphor”) and another term (called an "antecedent"), when the interpretation of the anaphor is in a certain way determined by the interpretation of the antecedent. Barbara Lust, Introduction to Studies in the Acquisition of Anaphora, D. Reidel, 1986 December 2005 CSA3180: Information Extraction II 52 Anaphora Example It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him. Orwell, 1984 antecedent December 2005 anaphor antecedent CSA3180: Information Extraction II anaphor 53 Anaphora • pronouns (personal, demonstrative, ...) – full pronouns – clitics (RO: dă-mi-l, IT: dammelo) • nouns – definite – indefinite • adjectives, numerals (generally associated with an ellipsis) • In this the play is expressionist1 in its approach to theme. • But it is also so1 in its use of unfamiliar devices... December 2005 CSA3180: Information Extraction II 54 Referential Expressions • mark the noun phrases • for each NP ask a question about it • keep as REs those NPs that can be naturally referenced in the question The policeman got in the car in a hurry in order to catch the run-away thief. December 2005 CSA3180: Information Extraction II 55 Referential Expressions a. John was going down the street looking for Bill‘s house. b. He found it at the first corner. December 2005 CSA3180: Information Extraction II 56 Referential Expressions a. John was going down the street looking for Bill‘s house. b. He met him at the first corner. December 2005 CSA3180: Information Extraction II 57 Referential Expressions The empty anaphor Gianni diede una mela a Michele. Piu tardi, gli diede un’arancia. [Not&Zancanara, 1996] John gave an apple to Michelle. Later on, gave her an orange. December 2005 CSA3180: Information Extraction II 58 Textual Ellipsis The functional (bridge) anaphora The state of the accumulator is indicated to the user. 30 minutes before the complete uncharge, the computer signals for 5 seconds. [Strube&Hahn, 1996] December 2005 CSA3180: Information Extraction II 59 Events, States, Descriptions He left without eating1. Because of this1 , he was starving in the evening. But, he adds, Priesley is more interested in Johnson living than in Johnson dead1. In this1 the play is expressionist in its approach to theme. [Halliday & Hassan, 1976] December 2005 CSA3180: Information Extraction II 60 Definite/Indefinite NPs Once upon a time, there was a king and a queen. And the king one day went hunting. Apollo took out his bow... Take the elevator to the 4th floor. December 2005 CSA3180: Information Extraction II 61 Anaphora Resolution • State of the art in Anaphora Resolution: – Identity: 65-80% – Other: much less… December 2005 CSA3180: Information Extraction II 62 What is so difficult? Nothing – everything is so simple! John1 has just arrived. He1 seems tired. The girl1 leaves the trash on the table and wants to go away. The boy2 tries to hold her1 by the arm31; she1 escapes and runs; he2 calls her1 back. Caragiale, At the Mansion December 2005 CSA3180: Information Extraction II 63 What is so difficult? Nothing indeed, but imagine letting the machine go wrong... There‘s a pile of inflammable trash next to your car. You‘ll have to get rid of it. If the baby does not thrive on the raw milk, boil it. [Hobbs, 1997] December 2005 CSA3180: Information Extraction II 64 What is so difficult? Semantic restrictions Jeff1 helped Dick2 wash the car. He1 washed the windows as Dick2 waxed the car. He1 soaped a pane. Jeff1 helped Dick2 wash the car. He1 washed the windows as Dick2 waxed the car. He2 buffed the hood. [Walker, Joshi & Prince, 1997] December 2005 CSA3180: Information Extraction II 65 What is so difficult? Semantic corelates An elephant1 hit the car with the trunk. The animal1 had to be taken away not to produce other damages. * An animal1 hit the car with the trunk. The elephant1 had to be taken away not to produce other damages. December 2005 CSA3180: Information Extraction II 66 What is so difficult? Long distance recovery (pronominalization) 1. 2. 3. 4. 5. His re-entry into Hollywood came with the movie “Brainstorm”, but its completion and release has been delayed by the death of co-star Natalie Wood. He plays Hugh Hefner of Playboy magazine in Bob Fosse’s “Star 80.” It’s about Dorothy Stratton, the Playboy Playmate who was killed by her husband. He also stars in the movie “Class.” Los Angeles Times, July 18, 1983, cited in [Fox, 1986] December 2005 CSA3180: Information Extraction II 67 What is so difficult? Gender mismatches Mr. Chairman..., what is her position upon this issue? (political correctness!!) Number mismatches The government discussed ... They ... December 2005 CSA3180: Information Extraction II 68 What is so difficult? Distributed antecedents John1 invited Mary2 to the cinema. After the movie ended they3={1,2} went to a restaurant. December 2005 CSA3180: Information Extraction II 69 What is so difficult? Empty/non-empty anaphors John gave an apple to Michelle. Later on, gave her an orange. John gave an apple to Michelle. Later on, he gave her an orange. John gave an apple to Michelle. Later on, this one asks him for an orange. December 2005 CSA3180: Information Extraction II 70 Semantics are Essential Police ... They Teacher... She/He A car... The automobile A Mercedes... The car A lamp... The bulb December 2005 CSA3180: Information Extraction II 71 Semantics are not all • Pronouns - poor semantic features he she it they [+animate, +male, +singular] [+animate, +female, +singular] [+inanimate, +singular] [+plural] • Gender in Romance languages Ro. maşină = ea (feminine) Ro. automobil = el (masculine) • Anaphora resolution by concord rules Un camion a heurté une voiture. Celle-ci a été complètement détruite. Gender match! Gender mismatch ! (A truck hit a car. It was completely destroyed.) December 2005 CSA3180: Information Extraction II 72 Anaphora Resolution [Charniak, 1972] It order to do AR, one has to be able to do everything else. Once everything else is done AR comes for free. December 2005 CSA3180: Information Extraction II 73 Anaphora Resolution Most current anaphora resolution systems implement a pipeline architecture with three modules: Referential expressions •Collect: determines the List of Potential Antecedents (LPAs). a1, a2, a3, … an Collect •Filter: eliminates from the LPA the referees that are incompatible with the referential expression under scrutiny. •Preference: determines the most likely antecedent on the basis of an ordering policy. December 2005 a1, a2, a3, … an Filter Preference CSA3180: Information Extraction II 74 Anaphora Resolution Models • [Hobbs, 1976] (pronominal anaphora) Naïve algorithm: - implies a surface parse tree - navigation on the syntactic tree of the anaphor‘s sentence and the preceding ones in the order of recency, each tree in a left-to-right, breadth-first manner A semantic approach: - implies a semantic representation of the sentences (logical expression) - a collection of semantic operations (inferences) - type of pronoun is important December 2005 CSA3180: Information Extraction II 75 Anaphora Resolution Models • [Lappin & Leass, 1994] (pronominal anaphora) - syntactic structures an intrasentensial syntactic filtering morphological filter (person, number, gender) detection of pleonastic pronouns salience parameters (grammatical role, parallelism of grammatical roles, frequency of mention, proximity, sentence recency) December 2005 CSA3180: Information Extraction II 76 Anaphora Resolution Models • [Sidner, 1981], [Grosz&Sidner, 1986] - focus/attentional based - give more salience to those semantic entities that are in focus - define where to look for an antecedent in the semantic structure of the preceding text (a stack in G&S‘s model) December 2005 CSA3180: Information Extraction II 77 AR Models: Centering • [Grosz, Joshi, Weinstein, 1983, 1995] • [Brennan, Friedman and Pollard, 1987] • Cf(u) = <e1, e2, ... ek> - an ordered list • Cb(u) = ei • Cp(u) = e1 Cb(u) = Cb(u-1) Cb(u) Cb(u-1) Cb(u) = Cp(u) Cb(u) Cp(u) CONTINUING SMOOTH SHIFT RETAINING ABRUPT SHIFT • CON > RET > SSH > ASH December 2005 CSA3180: Information Extraction II 78 AR Models: Centering a. I haven’t seen Jeff for several days. Cf = (I=[I], [Jeff]) Cb = [I] b. Carl thinks he’s studying for his exams. Cf = ([Carl], he=[Jeff], [Jeff´s exams]) Cb = [Jeff] c. I think he? went to the Cape with Linda. [Grosz, Joshi & Weinstein, 1983] December 2005 CSA3180: Information Extraction II 79 AR Models: Centering b. Carl thinks he’s studying for his exams. Cf = ([Carl], he=[Jeff], [Jeff´s exams]) Cb = [Jeff] c. I thinkJeff he? went to the Cape with Linda. Cf = (I=[I], he=[Jeff], [the Cape], [Linda]) Cb = [Jeff] Cf = (I=[I], he=[Carl], [the Cape], [Linda]) RETAINING Cb = [Carl] ABRUPT SHIFT December 2005 CSA3180: Information Extraction II 80 Anaphora Resolution Models • [Mitkov, 1998] - knowledge-poor approach POS tagger, noun phrase rules 2 previous sentences definiteness, giveness, lexical reiteration, section heading preference, distance, terms of the field, etc. December 2005 CSA3180: Information Extraction II 81 General Framework Build a framework capable of easily accommodating any of the existing AR models, fine-tune them, practice with them to enhance performance (learning), eventually obtaining a better model December 2005 CSA3180: Information Extraction II 82 General Framework text AR-engine AR-model1 AR-model2 AR-model3 December 2005 CSA3180: Information Extraction II 83 Co-References • Halliday and Hassan: a semantic relation, not a textual one Co-referential anaphoric relation a The text layer a evokes centera The semantic layer December 2005 b b evokes centera centera CSA3180: Information Extraction II 84 Time and Discourse • Discourse has a dynamic nature Time axes real time 1 2 discourse time 1 2 story time 2 800 December 2005 1 920 1000 1030 CSA3180: Information Extraction II 85 Resolution Moment Police officer David Cheshire went to Dillard's home. Putting his ear next to Dillard's head, Cheshire heard the music also. [Tanaka, 1999] Cheshire December 2005 Dillard his Dillard CSA3180: Information Extraction II Cheshire 86 Resolution Delay • Sanford and Garrod (1989) – initiation point – completion point • Information is kept in a temporary location of memory December 2005 CSA3180: Information Extraction II 87 Cataphora – What is there? • The element referred to is anticipated by the referring element • Theories – scepticism – syntactic reality From the corner of the divan of Persian saddle-bags on which he was lying, smoking, as was his custom, innumerable cigarettes, Lord Henry Wotton could just catch the gleam of the honey-sweet and honey-coloured blossoms of a laburnum… Oscar Wilde, The Picture of Dorian Gray December 2005 CSA3180: Information Extraction II 88 No right reference needed in discourse processing • Introduction of an empty discourse entity • Addition of new features as discourse unfolds • Pronoun anticipation in Romanian I taught Gabriel to read. = Ro. L-am învatat pe Gabriel sa citeasca. December 2005 CSA3180: Information Extraction II 89 Unique directionality in interpretation John he he gender = masc number = sg sem = person name = John gender = masc number = sg sem = person ? name = John anaphora cataphora December 2005 CSA3180: Information Extraction II John 90 Automatic Interpretation • necessity for an intermediate level a The text layer b RE a projects fsa fsa The restriction layer fsa evokes centera The semantic layer December 2005 centera CSA3180: Information Extraction II 91 Three Layer Approach to AR 1. John sold his bicycle 2. although Bill would have wanted it. his bicycle The text layer …………………………………………… projects The restrictions layer …… it projects no = sg no = sg ………………… sem=bicycle sem=¬human det = yes evokes evokes no = sg The semantic layer ………… sem=bicycle det = yes December 2005 CSA3180: Information Extraction II 92 Delayed Interpretation Police officer David Cheshire went to Dillard's home. Putting his ear next to Dillard's head, Cheshire heard the music also. t0 The text layer The restriction layer Cheshire fsCheshire t1 t2 t3 Dillard his Dillard fshis fsDillard candidates={ , } fsDillard The semantic layer Cheshire December 2005 Dillard CSA3180: Information Extraction II 93 Delayed Interpretation From the corner of the divan of Persian saddle-bags on which he was lying, smoking, as was his custom, innumerable cigarettes, Lord Henry Wotton could just catch the gleam of the honey-sweet and honey-coloured blossoms of a laburnum… t0 The text layer he time Lord Henry Wotton t2 t1 his projection gender=masc number=sing sem= person completes evoking name= Lord Henry Wotton The restriction layer The semantic layer December 2005 gender=masc gender = masc number=sing number = sing sem= sem =person person name= Lord Henry Wotton ? CSA3180: Information Extraction II evoking initiates 94 The case of Cataphora 1. Although Bill would have wanted it, 2. John sold his bicycle to somebody else. it The text layer …………………………………………… projects projects no = sg sem=bicycle det = yes no = sg ………………… sem=¬human The restrictions layer …… evokes The semantic layer ………… December 2005 his bicycle evokes no = sg sem=bicycle sem=¬human det = yes CSA3180: Information Extraction II 95 AR Models • • • • a set of primary attributes a set of knowledge sources a set of evocation heuristics or rules a set of rules that configure the domain of referential accessibility December 2005 CSA3180: Information Extraction II 96 AR Models REa REb REc REd The text layer ……………………….………………… REx knowledge sources The projection layer ……………………………….… DE DE m j The semantic layer ….…………………… attrx primary attributes DE1 heuristics/rules domain of referential accessibility December 2005 CSA3180: Information Extraction II 97 Set of Primary Attributes a. morphological number lexical gender person December 2005 CSA3180: Information Extraction II 98 Set of Primary Attributes b. syntactical -full syntactic description of REs as constituents of a syntactic tree [Lappin and Leass, 1994] CT based approaches [Grosz, Joshi and Weinstein, 1995], [Brennan, Friedman and Pollard, 1987], syntactic domain based approaches [Chomsky, 1981], [Reinhart, 1981], [Gordon and Hendricks, 1998], [Kennedy and Boguraev, 1996] -quality of being adjunct, embedded or complement of a preposition [Kennedy and Boguraev, 1996] -inclusion or not in an existential construction [Kennedy and Boguraev, 1996] -syntactic patterns in which the RE is involved syntactic parallelism [Kennedy and Boguraev, 1996], [Mitkov, 1997] December 2005 CSA3180: Information Extraction II 99 Set of Primary Attributes c. semantic -position of the head of the RE in a conceptual hierarchy (animacy, sex (or natural gender), concreteness) WordNet based models [Poesio, Vieira and Teufel, 1997] -inclusion in a synonymy class -semantic roles, out of which selectional restrictions, inferential links, pragmatic limitations, semantic parallelism and object preference can be verified December 2005 CSA3180: Information Extraction II 100 Set of Primary Attributes d. positional -offset of the first token of the RE in the text [Kennedy and Boguraev, 1996] -inclusion in an utterance, sentence or clause, considered as a discourse unit [Hobbs, 1987], Azzam, Humphreys and Gaizauskas, 1998], [Cristea et al., 2000] December 2005 CSA3180: Information Extraction II 101 Set of Primary Attributes e. surface realisation (type) the domain of this feature contains: zero-pronoun, clitic pronoun, full pronoun, reflexive pronoun, possessive pronoun, demonstrative pronoun, reciprocal pronoun, expletive “it”, bare noun (undetermined NP), indefinite determined NP, definite determined NP, proper noun (name) [Gordon and Hendricks, 1998], [Cristea et. al, 2000] December 2005 CSA3180: Information Extraction II 102 Set of Primary Attributes f. other inclusion or not of the RE in a specific lexical field (“domain concept”) [Mitkov, 1997] - frequency of the term in the text [Mitkov, 1997] - occurrence of the term in a heading [Mitkov, 1997] December 2005 CSA3180: Information Extraction II 103 Knowledge Sources • Type of process: incremental • A knowledge source: a (virtual) processor able to fetch values to attributes on the restriction layer • Minimum set: POS-tagger + shallow parser December 2005 CSA3180: Information Extraction II 104 Knowledge Sources • [Kennedy and Boguraev, 1996]: a marker of syntactic function and a set of patterns to recognises the expletive “it” (near specific sets of verbs or as subject of adjectives with clausal complements). • [Azzam, Humphreys and Gaizauskas, 1998]: a syntactic analyser, a semantic analyser, and an elementary events finder. • [Gordon and Hendrick, 1998]: a surface realisation identifier and a syntactic parser. • [Hobbs, 1978]: a syntactic analyser, a surface realisation identifier and a set of axioms to determine semantic roles and relations of lexical items. December 2005 CSA3180: Information Extraction II 105 Heuristics/Rules • demolishing rules (applied first): rule out a possible candidate. • promoting/demoting rules: increase/decrease a salience factor associated with an attribute. December 2005 CSA3180: Information Extraction II 106 Heuristics/Rules • [Kennedy and Boguraev, 1996]: a pronoun cannot corefer a constituent (NP) which contains it (the child of his brother, his is neither child, nor brother). The remaining candidates are sorted by weighing a set of attributevalues pairs (linguistically and experimentally justified). • [Gordon and Hendricks, 1997]: the antecedent’s syntactic prominence (notion related to the relative distance in a syntactic tree) influence the selection of the co-referential candidate. • [Gordon and Hendricks, 1998]: the salience of the relations between names and pronouns is calculated by using a graduation of surface realisation pairs: namepronoun > name-name > pronoun-name. December 2005 CSA3180: Information Extraction II 107 Referential Accessibility Domain • Linear • Dorepaal, Mitkov, ... • Hierarchical • Grosz&Sidner; Cristea, Ide&Romary, ... December 2005 CSA3180: Information Extraction II 108 Algorithm • consider the DEs in the order given by component 4 • for each attribute of the projected FS of the current anaphor and each candidate DE use rules of component 3 to update a preference score to link the anaphor to that DE as an antecedent • sort the candidates in the descending order of these scores • use thresholds to either: propose a new DE, link the anaphor to an existing DE or postpone decision December 2005 CSA3180: Information Extraction II 109