PPT - Search

advertisement
CSA3180: Natural Language
Processing
Information Extraction 2
• Named Entities
• Question Answering
• Anaphora Resolution
• Co-Reference
December 2005
CSA3180: Information Extraction II
1
Introduction
• Slides partially based on talk by Lucian
Vlad Lita
• Sheffield GATE Multilingual Extraction
slides based on Diana Maynard’s talks
• Anaphora resolution slides based on Dan
Cristea slides, with additional input from
Gabriela-Eugenia Dima, Oana Postolache
and Georgiana Puşcaşu
December 2005
CSA3180: Information Extraction II
2
References
• Fastus System Documentation
• Robert Gaizauskas “IE Perspective on
Text Mining”
• Daniel Bikel’s “Nymble: A High
Performance Learning Name Finder”
• Helena Ahonen-Myka’s notes on FSTs
• Javelin system documentation
• MUC 7 Overview & Results
December 2005
CSA3180: Information Extraction II
3
Named Entities
• Named Entities
• Person Name: Colin Powell, Frodo
• Location Name: Middle East, Aiur
• Organization: UN, DARPA
• Domain Specific vs. Open Domain
December 2005
CSA3180: Information Extraction II
4
Nymble (BBN Corporation)
•
•
•
•
State of the art system
Near-human performance ~90% accuracy
Statistical system
Approach: Hidden Markov Model (HMM)
December 2005
CSA3180: Information Extraction II
5
Nymble (BBN Corporation)
• Noisy channel paradigm
• Originally, entities were marked in the raw text
• Post noisy channel, annotation is lost
• Probability of most likely sequence of name
classes (NC) given a sequence of words (W)
Pr(NC|W) = Pr(W,NC) / Pr(W)
since the a priori probability of the word
sequence can be considered constant for any
given sentence  maximize just numerator
December 2005
CSA3180: Information Extraction II
6
Nymble (BBN Corporation)
Person
Start of Sentence
End of Sentence
Organization
Five other classes
Not-A-Name
December 2005
CSA3180: Information Extraction II
7
Automatic Content Extraction
• DARPA ACE Program
• Identify Entities
– Named: Bilbo, San Diego, UNICEF
– Nominal: the president, the hobbit
– Pronominal: she
• Reference resolution
– Clinton  the president  he
December 2005
CSA3180: Information Extraction II
8
Question Answering
The over-used pipeline paradigm:
Question
Question
Analysis
Information
Retrieval
Answer
December 2005
CSA3180: Information Extraction II
Answer
Extraction
Answer
Merging
9
Question Answering
• Feedback loops can be present for
constraint relaxation purposes
• Not all QA systems adhere to the pipeline
architecture
• Question answering flavors
– Factoid vs. complex
– Who invented paper? vs. Which of Mr. Bush’s friends are
Black Sabbath fans?
– Closed vs. open domain
December 2005
CSA3180: Information Extraction II
10
Answer Extraction
The over-used pipeline paradigm:
Question
Question
Analysis
Information
Retrieval
Answer
Answer
Extraction
Answer
Merging
• Focus on open domain, factoid question
answering
December 2005
CSA3180: Information Extraction II
11
Practical Issues
• Web Spell Checking
– Mispling
– nucular
• Infrequent forms:
– Niagra vs. Niagara, Filenes vs. Filene’s
• Google QA
• Genome, video, games
December 2005
CSA3180: Information Extraction II
12
Practical Issues
• Traditional Information Extraction
• Either expert built or statistical
• Specific strategies for specific question
types
• Person Bio vs. Location question types
• Ability to generalize to new questions and
new question types
December 2005
CSA3180: Information Extraction II
13
Practical Issues
• Who invented Blah?
Blah was invented by PersonName
Blah was Verb by PersonName
where Verb is synonym to invented
Blah VerbPhrase by PersonName
December 2005
CSA3180: Information Extraction II
14
Popular Resources
•
•
•
•
•
•
•
•
Experts and/or Learning Algorithms
Gazeteers
NE taggers
Part Of Speech taggers
Parsers
Wordnet
Stopword list
Stemmer
December 2005
CSA3180: Information Extraction II
15
Javelin Answer Extraction
• Statistical based system (IX)
– Decision Tree
– KNN
• XML wrapper
• Simple features
– POS
– NE tagger
– Lexical items
December 2005
CSA3180: Information Extraction II
16
Statistical IX Input
• Raw question
• Where is Frodo from?
• Analyzed question
• KEYWORD: Frodo
• ATYPE: Location
• QTYPE: WhereIsFrom
• Relevant document set
• NewYorkTimes 203214
• AssociatedPress 273545
December 2005
CSA3180: Information Extraction II
17
Statistical IX Output
• Set of candidate
answers
• Corresponding
passages
• Confidence score
• PASSAGE 1: Frodo
grew up in Pittsburgh
working in the steel
mills …
• ANSWER 1: Pittsburgh
• CONFIDENCE 1: .0924
• PASSAGE 2: …she met
Frodo, the French chef,
on Tech Street …
• ANSWER 2: French
• CONFIDENCE 2: .0493
December 2005
CSA3180: Information Extraction II
18
IX Decision Tree
Relevant Verb Present
No
Yes
Average Distance between
QTerms and ATerms  7
Yes
December 2005
No
More than 50% QTerms
Present
Yes
CSA3180: Information Extraction II
No
19
IX Training Data
• Positive examples – some correct
question and answer pairs
• Negative examples?
December 2005
CSA3180: Information Extraction II
20
IX Training Data
• Positive examples – some correct
question and answer pairs
• Negative examples
• All other sentences that do not contain the answer
to the question.
– Q: In what state can you find “Stop Except When Right
Turn” signs?
– A: Pennsylvania has SEWRD traffic signs.
December 2005
CSA3180: Information Extraction II
21
IX Training Data
• Positive examples – some correct
question and answer pairs
• Negative examples
• All other sentences that do not contain the answer
to the question.
• Sentences that contain the answer but do not
actually answer the question
– Q: In what state can you find “Stop Except Right Turn”
signs?
– A: The New York born driver resented SERT traffic signs.
December 2005
CSA3180: Information Extraction II
22
Text Quality and Comparisons
• Sentence Splitting
• Word Casing
• Distortion: negation, opinion, past event,
ordering
• Relaxation
• Ambiguity
December 2005
CSA3180: Information Extraction II
23
Text Quality and Comparisons
• Sentence Splitting
– Period, Exclamation Mark, Question Mark
– Ambiguities: Calif., Mr.
– Deeper ambiguities: No., A.
– Assuming abbreviations are detected
– I’d like to live in Southern Calif. where it never rains.
– I’d like to live in Southern Calif. It never rains there.
– Rule based sentence splitter and statistical
models
December 2005
CSA3180: Information Extraction II
24
Text Quality and Comparisons
• Word Casing
– News Stories – news source & style
• … Prime Minister Blair …
• … prime minister Blair …
– Headlines, Titles, Teasers
• Six Fired In New York Q-Mart Scandal
– Broadcast news transcription errors
• President George bush has …
December 2005
CSA3180: Information Extraction II
25
Text Quality and Comparisons
• Distortions
– Negation
• Frodo was not skiing in the Aspens last winter.
• Common Misconceptions about Warcraft III
– Opinion
• I really believe that Santa Claus exists.
– Past Event
• Long time ago, people did indeed live in caves.
– Ordinal numerals
• Aretha Franklin was the third woman to invent the soul.
December 2005
CSA3180: Information Extraction II
26
Text Quality and Comparisons
• Relaxation
– Reason
• Not enough documents
• No answer found
– Method
• Query expansion
• Synonymy based pattern expansion
December 2005
CSA3180: Information Extraction II
27
Text Quality and Comparisons
• Relaxation
– Reason
• Not enough documents
• No answer found
– Method
• Query expansion
• Synonymy based pattern expansion
– Pros:
• Invent can be extended to create and discover
– Cons
• Invent can be extended to Martha Stewart if enough
documents say that she re-invented herself 
December 2005
CSA3180: Information Extraction II
28
Text Quality and Comparisons
• Ambiguity
– Word level
• Raptors can be found in Pittsburgh on Forbes Ave.
– Velociraptors
– Motorcycles
– Reference resolution
• Frodo blah blah Sam blah blah, who blah blah
December 2005
CSA3180: Information Extraction II
29
Multi-Source and Multi-Lingual IE
• With traditional query engines, getting the facts can be
hard and slow
• Where has the Queen visited in the last year?
• Which places on the East Coast of the US have
had cases of West Nile Virus?
• Constructing a database through IE and linking it back to
the documents can provide a valuable alternative search
tool.
• Even if results are not always accurate, they can be
valuable if linked back to the original text
December 2005
CSA3180: Information Extraction II
30
Multi-Source and Multi-Lingual IE
• For access to news
• identify major relations and event types
(e.g. within foreign affairs or business
news)
• For access to scientific reports
• identify principal relations of a scientific
subfield (e.g. pharmacology, genomics)
December 2005
CSA3180: Information Extraction II
31
Application Example - KIM
Ontotext’s KIM query and results
December 2005
CSA3180: Information Extraction II
32
Application Example - GATE
December 2005
CSA3180: Information Extraction II
33
Complex Problems in NE
• Issues of style, structure, domain, genre
etc.
• Punctuation, spelling, spacing, formatting
Dept. of Computing and Maths
Manchester Metropolitan University
Manchester
United Kingdom
> Tell me more about Leonardo
> Da Vinci
December 2005
CSA3180: Information Extraction II
34
Approaches
• Knowledge Engineering
Learning Systems
• rule based
• developed by experienced
language engineers
• make use of human intuition
• require only small amount of
training data
• development can be very time
consuming
• some changes may be hard to
accommodate
•
December 2005
•
•
•
use statistics or other machine
learning
developers do not need LE
expertise
require large amounts of
annotated training data
some changes may require reannotation of the entire training
corpus
CSA3180: Information Extraction II
35
Shallow Parsing
• Internal evidence – names often have internal
structure. These components can be either
stored or guessed, e.g. location:
Cap. Word + {City, Forest, Center, River}
e.g. Sherwood Forest
Cap. Word + {Street, Boulevard, Avenue,
Crescent, Road}
e.g. Portobello Street
December 2005
CSA3180: Information Extraction II
36
Shallow Parsing
• Ambiguously capitalised words (first word in sentence)
[All American Bank] vs. All [State Police]
• Semantic ambiguity
"John F. Kennedy" = airport (location)
"Philip Morris" = organisation
• Structural ambiguity
[Cable and Wireless] vs.
[Microsoft] and [Dell]
[Center for Computational Linguistics] vs.
message from [City Hospital] for [John Smith]
December 2005
CSA3180: Information Extraction II
37
Shallow Parsing + Context
• Use of context-based patterns is helpful in
ambiguous cases
• "David Walton" and "Goldman Sachs" are
indistinguishable
• But with the phrase "David Walton of
Goldman Sachs" and the Person entity
"David Walton" recognised, we can use
the pattern "[Person] of [Organization]" to
identify "Goldman Sachs“ correctly.
December 2005
CSA3180: Information Extraction II
38
Shallow Parsing + Context
• Use KWIC index and concordancer to find
windows of context around entities
• Search for repeated contextual patterns of
either strings, other entities, or both
• Manually post-edit list of patterns, and
incorporate useful patterns into new rules
• Repeat with new entities
December 2005
CSA3180: Information Extraction II
39
Context Patterns
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
[PERSON] earns [MONEY]
[PERSON] joined [ORGANIZATION]
[PERSON] left [ORGANIZATION]
[PERSON] joined [ORGANIZATION] as [JOBTITLE]
[ORGANIZATION]'s [JOBTITLE] [PERSON]
[ORGANIZATION] [JOBTITLE] [PERSON]
the [ORGANIZATION] [JOBTITLE]
part of the [ORGANIZATION]
[ORGANIZATION] headquarters in [LOCATION]
price of [ORGANIZATION]
sale of [ORGANIZATION]
investors in [ORGANIZATION]
[ORGANIZATION] is worth [MONEY]
[JOBTITLE] [PERSON]
[PERSON], [JOBTITLE]
December 2005
CSA3180: Information Extraction II
40
Context Patterns
• Patterns are only indicators based on
likelihood
• Can set priorities based on frequency
thresholds
• Need training data for each domain
• More semantic information would be
useful (e.g. to cluster groups of verbs)
December 2005
CSA3180: Information Extraction II
41
Case Study: MUSE
• MUSE: MUlti-Source Entity Recognition
• An IE system developed within GATE
• Performs NE and coreference on different
text types and genres
• Uses knowledge engineering approach
with hand-crafted rules
• Performance rivals that of machine
learning methods
• Easily adaptable
December 2005
CSA3180: Information Extraction II
42
MUSE Modules
•
•
•
•
•
•
•
•
Document format and genre analysis
Tokenisation
Sentence splitting
POS tagging
Gazetteer lookup
Semantic grammar
Orthographic coreference
Nominal and pronominal coreference
December 2005
CSA3180: Information Extraction II
43
Switching Controller
• Rather than have a fixed chain of processing
resources, choices can be made automatically
about which modules to use
• Texts are analysed for certain identifying
features which are used to trigger different
modules
• For example, texts with no case information may
need different POS tagger or gazetteer lists
• Not all modules are language-dependent, so
some can be reused directly
December 2005
CSA3180: Information Extraction II
44
Multilingual MUSE
• MUSE has been adapted to deal with
different languages
• Currently systems for English, French,
German, Romanian, Bulgarian, Russian,
Cebuano, Hindi, Chinese, Arabic
• Separation of language-dependent and
language-independent modules and submodules
• Annotation projection experiments
December 2005
CSA3180: Information Extraction II
45
IE in Surprise Languages
• Adaptation to an unknown language in a very
short timespan
• Cebuano:
– Latin script, capitalisation, words are spaced
– Few resources and little work already done
– Medium difficulty
• Hindi:
– Non-Latin script, different encodings used, no
capitalisation, words are spaced
– Many resources available
– Medium difficulty
December 2005
CSA3180: Information Extraction II
46
Multilingual IE Requirements
• Extensive support for non-Latin scripts and text
encodings, including conversion utilities
– Automatic recognition of encoding
– Occupied up to 2/3 of the TIDES Hindi effort
• Bilingual dictionaries
• Annotated corpus for evaluation
• Internet resources for gazetteer list collection
(e.g., phone books, yellow pages, bi-lingual
pages)
December 2005
CSA3180: Information Extraction II
47
Multilingual Data Editing
GATE Unicode Kit (GUK)
Complements Java’s facilities
• Support for defining
Input Methods (IMs)
• currently 30 IMs
for 17 languages
• Pluggable in other
applications (e.g.
JEdit)
December 2005
CSA3180: Information Extraction II
48
Multilingual IE Processing
All processing, visualisation and editing tools use GUK
December 2005
CSA3180: Information Extraction II
49
Anaphora Resolution
unprocessed text
annotation tool
AR golden standard
AR engine
AR annotated text
fine-tuning
comparison & evaluation
December 2005
CSA3180: Information Extraction II
50
Anaphora Resolution
• Text:
– Nature of discourse
– Anaphoric phenomena
• Anaphora Resolution Engines:
– Models
– General AR Frameworks
– Knowledge Sources
December 2005
CSA3180: Information Extraction II
51
Anaphora Resolution
Anaphora represents the relation between
a “proform” (called an “anaphor”) and
another term (called an "antecedent"),
when the interpretation of the anaphor is in
a certain way determined by the
interpretation of the antecedent.
Barbara Lust, Introduction to Studies in the Acquisition of
Anaphora, D. Reidel, 1986
December 2005
CSA3180: Information Extraction II
52
Anaphora Example
It was a bright cold day in April, and the clocks were striking
thirteen. Winston Smith, his chin nuzzled into his breast in an
effort to escape the vile wind, slipped quickly through the glass
doors of Victory Mansions, though not quickly enough to
prevent a swirl of gritty dust from entering along with him.
Orwell, 1984
antecedent
December 2005
anaphor
antecedent
CSA3180: Information Extraction II
anaphor
53
Anaphora
•
pronouns (personal, demonstrative, ...)
– full pronouns
– clitics (RO: dă-mi-l, IT: dammelo)
•
nouns
– definite
– indefinite
•
adjectives, numerals (generally associated with an
ellipsis)
•
In this the play is expressionist1 in its approach to
theme.
•
But it is also so1 in its use of unfamiliar devices...
December 2005
CSA3180: Information Extraction II
54
Referential Expressions
• mark the noun phrases
• for each NP ask a question about it
• keep as REs those NPs that can be
naturally referenced in the question
The policeman got in the car in a hurry
in order to catch the run-away thief.
December 2005
CSA3180: Information Extraction II
55
Referential Expressions
a. John was going down the street looking
for Bill‘s house.
b. He found it at the first corner.
December 2005
CSA3180: Information Extraction II
56
Referential Expressions
a. John was going down the street looking
for Bill‘s house.
b. He met him at the first corner.
December 2005
CSA3180: Information Extraction II
57
Referential Expressions
The empty anaphor
Gianni diede una mela a Michele.
Piu tardi,  gli diede un’arancia.
[Not&Zancanara, 1996]
John gave an apple to Michelle.
Later on,  gave her an orange.
December 2005
CSA3180: Information Extraction II
58
Textual Ellipsis
The functional (bridge) anaphora
The state of the accumulator is indicated to
the user. 30 minutes before the complete
uncharge, the computer signals for 5
seconds.
[Strube&Hahn, 1996]
December 2005
CSA3180: Information Extraction II
59
Events, States, Descriptions
He left without eating1. Because of this1 , he
was starving in the evening.
But, he adds, Priesley is more interested in
Johnson living than in Johnson dead1.
In this1 the play is expressionist in its
approach to theme.
[Halliday & Hassan, 1976]
December 2005
CSA3180: Information Extraction II
60
Definite/Indefinite NPs
Once upon a time, there was a king and a
queen. And the king one day went hunting.
Apollo took out his bow...
Take the elevator to the 4th floor.
December 2005
CSA3180: Information Extraction II
61
Anaphora Resolution
• State of the art in Anaphora Resolution:
– Identity: 65-80%
– Other: much less…
December 2005
CSA3180: Information Extraction II
62
What is so difficult?
Nothing – everything is so simple!
John1 has just arrived. He1 seems tired.
The girl1 leaves the trash on the table and wants
to go away. The boy2 tries to hold her1 by the
arm31; she1 escapes and runs; he2 calls her1
back.
Caragiale, At the Mansion
December 2005
CSA3180: Information Extraction II
63
What is so difficult?
Nothing indeed, but imagine letting the
machine go wrong...
There‘s a pile of inflammable trash next to your
car. You‘ll have to get rid of it.
If the baby does not thrive on the raw milk, boil
it.
[Hobbs, 1997]
December 2005
CSA3180: Information Extraction II
64
What is so difficult?
Semantic restrictions
Jeff1 helped Dick2 wash the car.
He1 washed the windows as Dick2 waxed the car.
He1 soaped a pane.
Jeff1 helped Dick2 wash the car.
He1 washed the windows as Dick2 waxed the car.
He2 buffed the hood.
[Walker, Joshi & Prince, 1997]
December 2005
CSA3180: Information Extraction II
65
What is so difficult?
Semantic corelates
An elephant1 hit the car with the trunk. The animal1
had to be taken away not to produce other damages.
* An animal1 hit the car with the trunk. The elephant1
had to be taken away not to produce other
damages.
December 2005
CSA3180: Information Extraction II
66
What is so difficult?
Long distance recovery (pronominalization)
1.
2.
3.
4.
5.
His re-entry into Hollywood came with the movie “Brainstorm”,
but its completion and release has been delayed by the death of
co-star Natalie Wood.
He plays Hugh Hefner of Playboy magazine in Bob Fosse’s “Star
80.”
It’s about Dorothy Stratton, the Playboy Playmate who was killed
by her husband.
He also stars in the movie “Class.”
Los Angeles Times, July 18, 1983, cited in [Fox, 1986]
December 2005
CSA3180: Information Extraction II
67
What is so difficult?
Gender mismatches
Mr. Chairman..., what is her position upon
this issue? (political correctness!!)
Number mismatches
The government discussed ... They ...
December 2005
CSA3180: Information Extraction II
68
What is so difficult?
Distributed antecedents
John1 invited Mary2 to the cinema. After the
movie ended they3={1,2} went to a
restaurant.
December 2005
CSA3180: Information Extraction II
69
What is so difficult?
Empty/non-empty anaphors
John gave an apple to Michelle.
Later on,  gave her an orange.
John gave an apple to Michelle.
Later on, he gave her an orange.
John gave an apple to Michelle.
Later on, this one asks him for an orange.
December 2005
CSA3180: Information Extraction II
70
Semantics are Essential
Police ... They
Teacher... She/He
A car... The automobile
A Mercedes... The car
A lamp... The bulb
December 2005
CSA3180: Information Extraction II
71
Semantics are not all
• Pronouns - poor semantic features
he
she
it
they
[+animate, +male, +singular]
[+animate, +female, +singular]
[+inanimate, +singular]
[+plural]
• Gender in Romance languages
Ro. maşină = ea (feminine)
Ro. automobil = el (masculine)
• Anaphora resolution by concord rules
Un camion a heurté une voiture. Celle-ci a été complètement détruite.
Gender match!
Gender mismatch !
(A truck hit a car. It was completely destroyed.)
December 2005
CSA3180: Information Extraction II
72
Anaphora Resolution
[Charniak, 1972]
It order to do AR, one has to be able to do
everything else. Once everything else is
done AR comes for free.
December 2005
CSA3180: Information Extraction II
73
Anaphora Resolution
Most current anaphora resolution
systems implement a pipeline
architecture with three modules:
Referential
expressions
•Collect:
determines the List of Potential
Antecedents (LPAs).
a1, a2, a3, … an
Collect
•Filter:
eliminates from the LPA the referees
that are incompatible with the
referential expression under scrutiny.
•Preference:
determines the most likely antecedent
on the basis of an ordering policy.
December 2005
a1, a2, a3, … an
Filter
Preference
CSA3180: Information Extraction II
74
Anaphora Resolution Models
• [Hobbs, 1976] (pronominal anaphora)
Naïve algorithm:
- implies a surface parse tree
- navigation on the syntactic tree of the anaphor‘s
sentence and the preceding ones in the order of recency,
each tree in a left-to-right, breadth-first manner
A semantic approach:
- implies a semantic representation of the sentences
(logical expression)
- a collection of semantic operations (inferences)
- type of pronoun is important
December 2005
CSA3180: Information Extraction II
75
Anaphora Resolution Models
• [Lappin & Leass, 1994] (pronominal
anaphora)
-
syntactic structures
an intrasentensial syntactic filtering
morphological filter (person, number, gender)
detection of pleonastic pronouns
salience parameters (grammatical role,
parallelism of grammatical roles, frequency of
mention, proximity, sentence recency)
December 2005
CSA3180: Information Extraction II
76
Anaphora Resolution Models
• [Sidner, 1981], [Grosz&Sidner, 1986]
- focus/attentional based
- give more salience to those semantic
entities that are in focus
- define where to look for an antecedent in
the semantic structure of the preceding
text (a stack in G&S‘s model)
December 2005
CSA3180: Information Extraction II
77
AR Models: Centering
• [Grosz, Joshi, Weinstein, 1983, 1995]
• [Brennan, Friedman and Pollard, 1987]
• Cf(u) = <e1, e2, ... ek> - an ordered list
• Cb(u) = ei
• Cp(u) = e1
Cb(u) = Cb(u-1) Cb(u)  Cb(u-1)
Cb(u) = Cp(u)
Cb(u)  Cp(u)
CONTINUING
SMOOTH
SHIFT
RETAINING
ABRUPT
SHIFT
• CON > RET > SSH > ASH
December 2005
CSA3180: Information Extraction II
78
AR Models: Centering
a. I haven’t seen Jeff for several days.
Cf = (I=[I], [Jeff])
Cb = [I]
b. Carl thinks he’s studying for his exams.
Cf = ([Carl], he=[Jeff], [Jeff´s exams])
Cb = [Jeff]
c. I think he? went to the Cape with Linda.
[Grosz, Joshi & Weinstein, 1983]
December 2005
CSA3180: Information Extraction II
79
AR Models: Centering
b. Carl thinks he’s studying for his exams.
Cf = ([Carl], he=[Jeff], [Jeff´s exams])
Cb = [Jeff]
c. I thinkJeff
he? went to the Cape with Linda.
Cf = (I=[I], he=[Jeff], [the Cape], [Linda])
Cb = [Jeff]
Cf = (I=[I], he=[Carl], [the Cape], [Linda])
RETAINING
Cb = [Carl]
ABRUPT SHIFT
December 2005
CSA3180: Information Extraction II
80
Anaphora Resolution Models
• [Mitkov, 1998]
-
knowledge-poor approach
POS tagger, noun phrase rules
2 previous sentences
definiteness, giveness, lexical reiteration,
section heading preference, distance,
terms of the field, etc.
December 2005
CSA3180: Information Extraction II
81
General Framework
Build a framework capable of easily
accommodating any of the existing AR
models, fine-tune them, practice with them
to enhance performance (learning),
eventually obtaining a better model
December 2005
CSA3180: Information Extraction II
82
General Framework
text
AR-engine
AR-model1
AR-model2
AR-model3
December 2005
CSA3180: Information Extraction II
83
Co-References
• Halliday and Hassan: a semantic relation, not
a textual one
Co-referential anaphoric relation
a
The text layer
a evokes centera
The semantic layer
December 2005
b
b evokes centera
centera
CSA3180: Information Extraction II
84
Time and Discourse
• Discourse has a dynamic nature
Time axes
real time
1
2
discourse time
1
2
story time
2
800
December 2005
1
920
1000
1030
CSA3180: Information Extraction II
85
Resolution Moment
Police officer David Cheshire went to Dillard's home.
Putting his ear next to Dillard's head, Cheshire heard the
music also.
[Tanaka, 1999]
Cheshire
December 2005
Dillard
his
Dillard
CSA3180: Information Extraction II
Cheshire
86
Resolution Delay
• Sanford and Garrod (1989)
– initiation point
– completion point
• Information is kept in a temporary location of
memory
December 2005
CSA3180: Information Extraction II
87
Cataphora – What is there?
• The element referred to is anticipated by the referring
element
• Theories
– scepticism
– syntactic reality
From the corner of the divan of Persian saddle-bags on
which he was lying, smoking, as was his custom,
innumerable cigarettes, Lord Henry Wotton could just catch
the gleam of the honey-sweet and honey-coloured
blossoms of a laburnum…
Oscar Wilde, The Picture of Dorian Gray
December 2005
CSA3180: Information Extraction II
88
No right reference needed in
discourse processing
• Introduction of an empty discourse entity
• Addition of new features as discourse unfolds
• Pronoun anticipation in Romanian
I taught Gabriel to read. = Ro. L-am învatat pe Gabriel sa citeasca.
December 2005
CSA3180: Information Extraction II
89
Unique directionality in
interpretation
John
he
he
gender = masc
number = sg
sem = person
name = John
gender = masc
number = sg
sem = person
?
name = John
anaphora
cataphora
December 2005
CSA3180: Information Extraction II
John
90
Automatic Interpretation
• necessity for an intermediate level
a
The text layer
b
RE a projects fsa
fsa
The restriction layer
fsa evokes centera
The semantic layer
December 2005
centera
CSA3180: Information Extraction II
91
Three Layer Approach to AR
1. John sold his bicycle
2. although Bill would have wanted it.
his bicycle
The text layer ……………………………………………
projects
The restrictions layer ……
it
projects
no = sg
no = sg
…………………
sem=bicycle
sem=¬human
det = yes
evokes
evokes
no = sg
The semantic layer ………… sem=bicycle
det = yes
December 2005
CSA3180: Information Extraction II
92
Delayed Interpretation
Police officer David Cheshire went to Dillard's home. Putting his ear next to
Dillard's head, Cheshire heard the music also.
t0
The text layer
The restriction layer
Cheshire
fsCheshire
t1
t2
t3
Dillard
his
Dillard
fshis
fsDillard candidates={ , }
fsDillard
The semantic layer
Cheshire
December 2005
Dillard
CSA3180: Information Extraction II
93
Delayed Interpretation
From the corner of the divan of Persian saddle-bags on which he was lying,
smoking, as was his custom, innumerable cigarettes, Lord Henry Wotton could just
catch the gleam of the honey-sweet and honey-coloured blossoms of a laburnum…
t0
The text layer
he
time
Lord Henry Wotton
t2
t1
his
projection
gender=masc
number=sing
sem=
person completes
evoking
name= Lord Henry Wotton
The restriction layer
The semantic layer
December 2005
gender=masc
gender = masc
number=sing
number = sing
sem=
sem =person
person
name= Lord
Henry Wotton
?
CSA3180: Information Extraction II
evoking initiates
94
The case of Cataphora
1. Although Bill would have wanted it,
2. John sold his bicycle to somebody else.
it
The text layer ……………………………………………
projects
projects
no = sg
sem=bicycle
det = yes
no = sg
…………………
sem=¬human
The restrictions layer ……
evokes
The semantic layer …………
December 2005
his bicycle
evokes
no = sg
sem=bicycle
sem=¬human
det = yes
CSA3180: Information Extraction II
95
AR Models
•
•
•
•
a set of primary attributes
a set of knowledge sources
a set of evocation heuristics or rules
a set of rules that configure the domain of
referential accessibility
December 2005
CSA3180: Information Extraction II
96
AR Models
REa
REb
REc
REd
The text layer ……………………….…………………
REx
knowledge sources
The projection layer ……………………………….…
DE
DE
m
j
The semantic layer ….……………………
attrx
primary attributes
DE1
heuristics/rules
domain of referential accessibility
December 2005
CSA3180: Information Extraction II
97
Set of Primary Attributes
a. morphological
number
lexical gender
person
December 2005
CSA3180: Information Extraction II
98
Set of Primary Attributes
b. syntactical
-full syntactic description of REs as constituents of a syntactic tree
[Lappin and Leass, 1994]
CT based approaches [Grosz, Joshi and Weinstein, 1995],
[Brennan, Friedman and Pollard, 1987], syntactic domain based
approaches [Chomsky, 1981], [Reinhart, 1981], [Gordon and
Hendricks, 1998], [Kennedy and Boguraev, 1996]
-quality of being adjunct, embedded or complement of a preposition
[Kennedy and Boguraev, 1996]
-inclusion or not in an existential construction
[Kennedy and Boguraev, 1996]
-syntactic patterns in which the RE is involved
syntactic parallelism [Kennedy and Boguraev, 1996], [Mitkov, 1997]
December 2005
CSA3180: Information Extraction II
99
Set of Primary Attributes
c. semantic
-position of the head of the RE in a conceptual
hierarchy (animacy, sex (or natural gender),
concreteness)
WordNet based models [Poesio, Vieira and
Teufel, 1997]
-inclusion in a synonymy class
-semantic roles, out of which selectional
restrictions, inferential links, pragmatic
limitations, semantic parallelism and object
preference can be verified
December 2005
CSA3180: Information Extraction II
100
Set of Primary Attributes
d. positional
-offset of the first token of the RE in the text
[Kennedy and Boguraev, 1996]
-inclusion in an utterance, sentence or clause,
considered as a discourse unit
[Hobbs, 1987], Azzam, Humphreys and
Gaizauskas, 1998], [Cristea et al., 2000]
December 2005
CSA3180: Information Extraction II
101
Set of Primary Attributes
e. surface realisation (type)
the domain of this feature contains: zero-pronoun, clitic
pronoun, full pronoun, reflexive pronoun, possessive
pronoun, demonstrative pronoun, reciprocal pronoun,
expletive “it”, bare noun (undetermined NP), indefinite
determined NP, definite determined NP, proper noun
(name)
[Gordon and Hendricks, 1998], [Cristea et. al, 2000]
December 2005
CSA3180: Information Extraction II
102
Set of Primary Attributes
f. other
inclusion or not of the RE in a specific lexical field (“domain
concept”)
[Mitkov, 1997]
- frequency of the term in the text
[Mitkov, 1997]
- occurrence of the term in a heading
[Mitkov, 1997]
December 2005
CSA3180: Information Extraction II
103
Knowledge Sources
• Type of process: incremental
• A knowledge source: a (virtual) processor able
to fetch values to attributes on the restriction
layer
• Minimum set: POS-tagger + shallow parser
December 2005
CSA3180: Information Extraction II
104
Knowledge Sources
• [Kennedy and Boguraev, 1996]: a marker of syntactic function and a
set of patterns to recognises the expletive “it” (near specific sets of
verbs or as subject of adjectives with clausal complements).
• [Azzam, Humphreys and Gaizauskas, 1998]: a syntactic analyser, a
semantic analyser, and an elementary events finder.
• [Gordon and Hendrick, 1998]: a surface realisation identifier and a
syntactic parser.
• [Hobbs, 1978]: a syntactic analyser, a surface realisation identifier
and a set of axioms to determine semantic roles and relations of
lexical items.
December 2005
CSA3180: Information Extraction II
105
Heuristics/Rules
• demolishing rules (applied first): rule out a
possible candidate.
•
promoting/demoting rules:
increase/decrease a salience factor
associated with an attribute.
December 2005
CSA3180: Information Extraction II
106
Heuristics/Rules
• [Kennedy and Boguraev, 1996]: a pronoun cannot corefer a constituent (NP) which contains it (the child of his
brother, his is neither child, nor brother). The remaining
candidates are sorted by weighing a set of attributevalues pairs (linguistically and experimentally justified).
• [Gordon and Hendricks, 1997]: the antecedent’s
syntactic prominence (notion related to the relative
distance in a syntactic tree) influence the selection of the
co-referential candidate.
• [Gordon and Hendricks, 1998]: the salience of the
relations between names and pronouns is calculated by
using a graduation of surface realisation pairs: namepronoun > name-name > pronoun-name.
December 2005
CSA3180: Information Extraction II
107
Referential Accessibility Domain
• Linear
•
Dorepaal, Mitkov, ...
• Hierarchical
•
Grosz&Sidner; Cristea, Ide&Romary, ...
December 2005
CSA3180: Information Extraction II
108
Algorithm
• consider the DEs in the order given by
component 4
• for each attribute of the projected FS of the
current anaphor and each candidate DE use
rules of component 3 to update a preference
score to link the anaphor to that DE as an
antecedent
• sort the candidates in the descending order of
these scores
• use thresholds to either: propose a new DE, link
the anaphor to an existing DE or postpone
decision
December 2005
CSA3180: Information Extraction II
109
Download