Core Technologies for IE

advertisement
Core Technologies for IE
Günter Neumann & Feiyu Xu
{neumann, feiyu}@dfki.de
Language Technology-Lab
DFKI, Saarbrücken
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Outline
• Overview
• Shallow technologies
• More NL understanding
• Hybrid approaches
• Additional Topics
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Types of IE Tasks
Extraction of ...
•
•
•
•
•
•
Topics
Terms
Named Entities
Simple Relations
Complex Relations (template filling)
Answers to ad-hoc questions (QA systems)
• Some tasks require the merging of information from
different parts of a document (e.g. template merging)
and/or the fusion of information from several
documents (information fusion).
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
IE Core Technologies
‰
A wide range of technologies for information extraction has
emerged.
‰
Technologies range from non-linguistic approaches, e.g.,
methods for the exploitation of formatting and other formal
properties of semi-structured documents all the way to truly
linguistic approaches.
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
IE Core Technologies
‰
8QVWUXFWXUHGGDWD
‰
Depending on the amount of structural analysis, NLP methods
may be classified as VKDOORZ or GHHS methods.
‰
Whereas deep methods try to analyze most or all of the
linguistic structure, shallow methods derive much less
structure.
in the sense of computer science usually
exhibit a complex linguistic structure. NLP methods are
employed to detect and exploit this structure.
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
IE Core Technologies
‰
Discrete methods are based on symbolic rules, patterns,
principles such as rewriting rules or regular expressions.
‰
Nondiscrete methods are based on statistical/probabilistic
techniques or neural networks.
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
‰
The choice of the method depends on a number of criteria.
‰
Task: what should be extracted? (topics, names, terms,
relations, template fillers, answers)
‰
What are the performance criteria of the application? (recall,
precision, efficiency)
‰
What are the design properties of the system: maintainability,
adaptivity, interoperability, scalability, etc?
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
‰
Deep methods are accurate but lack efficiency and often also
coverage (recall).
‰
Shallow methods are generally more efficient than deep ones and
usually also often support wider coverage and therefore recall.
‰
Many discrete methods are well-suited for the manual design of rule
or pattern sets.
‰
Non-discrete methods offer the advantage of modelling soft
regularieties, especially the weighted contributions of several factors
to classification decisions. For many non-discrete methods, very
effective machine-learning schemes exist, supporting adaptivity and
scalability.
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
discrete
(symbolic)
non-discrete
(statistical)
shallow
weighted
FSTs
FSTs
SVM
SVM
HMMs
chunk
parser
CFG
Parser
deep
PCFG
Parser
HPSG
Parser
HPSG Parser with
statistical parse selection
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
discrete
(symbolic)
non-discrete
(statistical)
shallow
SVM
SVM
FSTs
deep
Hybrid
Approaches
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
discrete
(symbolic)
non-discrete
(statistical)
shallow
FSTs
deep
HPSG
Parser
Hybrid
Approaches
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
discrete
(symbolic)
non-discrete
(statistical)
shallow
SVM
SVM
FSTs
deep
HPSG
Parser
Hybrid
Approaches
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Shallow NLP
• Shallow NLP
• Coarse-grained linguistic analysis tailored to a specific
domain
• Less structure
• Direct mapping from linguistic to domain structure
• Widely used approach
• Bottom-up, chunk recognition (named entities, NP, verb
groups)
• Template filling on basis of domain-specific verb-frames or
other patterns
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with
a Japanese trading house to produce golf clubs to be supplied to Japan.
5HFRJQL]HGDVDMRLQWYHQWXUHHYHQWE\DSSO\LQJWHPSODWHSDWWHUQ
[COMPANY][SET-UP][JOINT-VENTURE] (others)* with[COMPANY]
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Shallow Techniques
• Use of finite state transducers which map from
surface form to IE relevant concepts and relations
• Systems and platforms
• SRI Technologies
• FASTUS
• Sheffield
• GATE
• DFKI IE core technologies
• IE from real-world German free texts
• SPROUT
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Why Finite State Technology?
•
most of the relevant local phenomena can be easily
expressed as finite-state devices
•
•
time & space efficiency
•
there exist much more powerful formalisms like
context free grammars or unification grammars, but
application developers prefer more pragmatic
solutions
existence of efficient determinization, minimization
and optimization transformations
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
').,)LQLWH6WDWH7RROV2YHUYLHZ
•
•
efficiency-oriented implementation
architecture and functionality is mainly based on the tools developed by
AT&T
•representation: textual format vs. compressed binary format
0
1.0
----------------------0
1
a
b
1.0
0
2
a
c
2.0
----------------------1
3.0
2
•
4.0
most of the provided operations are based on recent approaches
(Mohri, Pereira, Roche, Schabes)
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
').,)LQLWH6WDWH7RROV2YHUYLHZ
•
operations are divided into four main pools:
converting operations:
- converting textual representation into binary format and vice versa
- creating graph representation for FSMs, etc.
rational and combination operations:
- union
- reversion
- concatenation
- intersection
- closure
- inversion
- local extension
- composition
equivalence transformations:
- determinization
- epsilon removal
- bunch of minimization algorithms
- trimming
other:
- extending input/output alphabet
- collecting arcs with identical labels
- determinicity test
- information about an FSM
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
').,)LQLWH6WDWH7RROV([DPSOHV
•
Keywords Recognition - Automata Search
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
').,)LQLWH6WDWH7RROV([DPSOHV
•
Keywords Recognition - Deterministic Search
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
').,)LQLWH6WDWH7RROV([DPSOHV
•
Keywords Recognition - Minimal Deterministic Search
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
').,)LQLWH6WDWH7RROV([DPSOHV
•
contextual PART-of-SPEECH filtering rule:
if the previous word form is a GHWHUPLQHU and the next word form is a
QRXQ then
filter out the YHUEUHDGLQJ
example: ... die bekannten Bilder .... („WKHNQRZQSLFWXUHV³)
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Traditional IE Architecture
Tokenization
Local Text Analysis
Morphological and
Lexical Processing
Parsing
Discourse Analysis
Course: Intelligent Information Extraction
Text Sectionizing and
Filtering
Part of Speech Tagging
Word Sense Tagging
Fragment Processing
Fragment Combination
Scenario Pattern Matching
Coreference Resolution
Inference
Template Merging
Neumann & Xu
Esslli Summer School 2004
FASTUS
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
FASTUS
• Acronym for )LQLWH6WDWH$XWRPDWRQ7H[W8QGHUVWDQGLQJ
6\VWHP
• Developed in SRI International, Menlo Park, California
• Joined MUC-4 (92), MUC-5 (93), MUC-6 (95), MUC-7
(97)
• English and Japanese
• Inspired by
• the performance in MUC-3 that the group at the University of
Massachusetts got out of a simple system [Lehnert et al., 1991]
• Pereira’s work on finite-state approximations of grammars
[Pereira, 1990]
• Works as a set of cascaded and nondeterministic finitestate automata
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
FASTUS Architecture
Text
Complex Words
(Tokenization)
Recognizing Phrases
Basic Phrases
Complex Phrases
MRLQWYHQWXUHVHWXS
%ULGJVWRQH6SRUWV7DLZDQ&R
QRXQJURXSV
YHUEJURXSV
KDGVHWXS
DSSRVLWLYHDWWDFKPHQW
: 3HWHUWKHGLUHFWRU
SSDWWDFKPHQW
Domain Event
Patterns
(parsing)
Merging Structures
(discourse analysis)
Course: Intelligent Information Extraction
Neumann & Xu
:
DORFDOFRQFHUQ
:
Esslli Summer School 2004
: SURGXFWLRQRILURQDQGZRRG
Example of Phrase Recognition
Salvadoran President-elect Afredo Cristiani condemned
the terrorist killing of Attorney General Roberto Garcia
Alvarado and accused the Farabundo Marti National
Liberation Front (FMLN) of the crime
1RXQ*URXS
6DOYDGRUDQ3UHVLGHQWHOHFW
1DPH
$OIUHGR&ULVWLDQL
9HUE*URXS
FRQGHPQHG
1RXQ*URXS
WKHWHUURULVWNLOOLQJ
3UHSRVLWLRQ
RI
1RXQ*URXS
$WWRUQH\*HQHUDO
1DPH
5REHUWR*DUHLD$OYDUDGR
&RQMXQFWLRQ
DQG
9HUE*URXS
DFFXVHG
1RXQ*URXS
WKH)DUDEXQGR0DUWL1DWLRQDO/LEHUDWLRQ)URQW)0/1
3UHSRVLWLRQ
RI
1RXQ*URXS
WKHFULPH
Course: Intelligent
Information Extraction
Neumann & Xu
Esslli Summer School 2004
Example of Domain Event Pattern Recognition
[Salvadoran President-elect Afredo Cristiani]2 condemned
[the terrorist killing of Attorney General Roberto Garcia
Alvarado]1 and [accused the Farabundo Marti National
Liberation Front (FMLN) of crime]2
7ZRSDWWHUQVDUHUHFRJQL]HG
3HUSHWUDWRU!NLOOLQJRI+XPDQ7DUJHW!
*RYW2IILFLDO!DFFXVHG3HUS2UJ!RI,QFLGHQW!
7ZRLQFLGHQWVWUXFWXUHVDUHFRQVWUXFWHG
,QFLGHQW
.,//,1*
3HUSHWUDWRU
ÄWHUURULVW³
&RQILGHQFH
BB
+XPDQ7DUJHW
Ä5REHUWR*DUFLD$OYDUDGR³
,QFLGHQW
.,//,1*
3HUSHWUDWRU
)0/1
&RQILGHQFH
$FFXVHGE\$XWKRULWLHV
+XPDQ7DUJHW
BB
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
“pseudo-syntax”
ƒ
The material between the end of the subject noun group
and the beginning of the main verb group must be read
over, for example,
9
Read over prepositional phrases and relative clauses
1. Subject {Preposition NounGroup}* VerbGroup
2. Subject Relpro {NounGroup | Other }* VerbGroup
The mayor, who was kidnapped yesterday, was found dead today.
ƒ
Conjoined verb phrase, skipping over the first conjunct and
associate the subject with the verb group in the second
conjunct
Subject VerbGroup {NounGroup|Other}* Conjunction VerbGroup
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Problem of “pseudo-syntax”
ƒ
6DPHVHPDQWLFFRQWHQWFDQEHUHDOL]HGLQGLIIHUHQW
IRUPV
*0PDQXIDFWXUHVFDUV
&DUVDUHPDQXIDFWXUHGE\*0
*0ZKLFKPDQXIDFWXUHVFDUV
FDUVZKLFKDUHPDQXIDFWXUHGE\*0
FDUVPDQXIDFWXUHGE\*0
*0LVWRPDQXIDFWXUHFDUV
&DUVDUHWREHPDQXIDFWXUHGE\*0
*0LVDFDUPDQXIDFWXUHU
ƒ
4XHVWLRQ+RZPDQ\UXOHVDUHQHHGHGWRH[WUDFWDOO
UHOHYDQWSDWWHUQV":K\QRWXVLQJDOLQJXLVWLFWKHRU\"
•
•
•
3HUIRUPDQFHYV&RPSRWHQFH
7$&,786KRXUVWRSURFHVV0HVVDJHV
)$6786PLQXWHVWRSURFHVVPHVVDJHV
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Example of Template Merging
Salvadoran President-elect Afredo Cristiani condemned the
terrorist killing of Attorney General Roberto Garcia Alvarado
and accused the Farabundo Marti National Liberation Front
(FMLN) of crime
Two incident structures are merged as
Incident:
Perpetrator:
Confidence:
Human Target:
KILLING
„terrorist“
__
„Roberto Garcia Alvarado
Course: Intelligent Information Extraction
Neumann & Xu
Incident:
Perpetrator:
Confidence:
Human Target:
Incident:
Perpetrator:
Confidence:
Human Target:
KILLING
FMLN
Accused by Authorities
__
KILLING
FMLN
Accused by Authorities
„Roberto Garcia Alvarado
Esslli Summer School 2004
GATE
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
*$7( ² D*HQHUDO$UFKLWHFWXUHIRU7H[W (QJLQHHULQJ
• $QDUFKLWHFWXUH
A macro-level organisational picture for LE software systems.
• $IUDPHZRUN
For programmers, GATE is an object-oriented class library that implements
the architecture.
• $GHYHORSPHQWHQYLURQPHQW
For language engineers, computational linguists et al, GATE is a graphical
development environment bundled with a set of tools for doing e.g.
Information Extraction.
• 6RPHIUHHFRPSRQHQWV ...and wrappers for other people's components
• 7RROV for: evaluation; visualise/edit; persistence; IR; IE; dialogue;
ontologies; etc.
• )UHH software (LGPL). Download at http://gate.ac.uk/download/
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
DFKI IE Core Technologies
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Information Extraction from
real-world German text
SMES
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Domain Adaptive IE Architecture
0$,1*2$/6
7H[W
• systematic treatment of domainindependent and domain-specific
knowledge
• robust and fast shallow text
processor (mildly deep -)
• abstract level of linking
between linguistic and domain
knowledge
,(FRUH6\VWHP
6KDOORZ
7H[W
3URFHVVRU
8)'
.QRZOHGJH
RQWRORJ\
OH[LFRQ
7HPSODWH
*HQHUDWLRQ
,QVWDQWLDWHG7HPSODWHV
Course: Intelligent Information Extraction
Neumann & Xu
'RPDLQ
Esslli Summer School 2004
underspecified (partial) functional descriptions
UFDs
8)'
:
IODWGHSHQGHQF\EDVHGVWUXFWXUHRQO\XSSHUERXQGVIRUDWWDFKPHQWDQGVFRSLQJ
[PNDie Siemens GmbH] [Vhat] [year1988][NPeinen Gewinn] [PPvon 150 Millionen DM],
[Compweil] [NPdie Auftraege] [PPim Vergleich] [PPzum Vorjahr] [Cardum 13%] [Vgestiegen
sind].
³7KHVLHPHQVFRPSDQ\KDVPDGHDUHYHQXHRIPLOOLRQPDUNVLQVLQFHWKH
RUGHUVLQFUHDVHGE\FRPSDUHGWRODVW\HDU´
6XEM
hat
33V
&RPS
weil
2EM
Siemens
6&
{1988, von(150M)}
steigen
Gewinn
33V
6XEM
{im(Vergleich),zum(Vorjahr),um(13%)}
Course: Intelligent Information Extraction
Neumann & Xu
Auftrag
Esslli Summer School 2004
Major Shallow Components
• C++ Toolkit for weighted finite state transducers (WFST)
• e.g., determination, minimization, concatenation, local extension
(following advanced approaches developed at AT&T, Xerox)
• compact and efficient internal representations
• Dynamic tries for lexical & morphological processing
• recursive traversal (e.g., for compound & derivation analysis)
• robust retrieval (e.g., shortest/longest suffix/prefix)
• Parameterizable XML-output interface
• Both tools are portable across different platforms (Unix &
Course: Intelligent
Extraction
LinuxInformation
& Windows
NT)
Neumann & Xu
Esslli Summer School 2004
GRPDLQLQGHSHQGHQWVKDOORZWH[WSURFHVVLQJFRPSRQHQWV
(;$03/(UXQGELV3UR]HQWGHU6WHLJHUXQJVUDWH
ASCII
Documents
DERXWWRSHUFHQWJURZWKUDWH
UXQGORZZ
7RNHQL]HU
LQW
FODVVHV
ZRUGIRUPV
/H[LFDO3URFHVVRU
/LQJXLVWLF
.QRZOHGJH
3RRO
326)LOWHULQJ
6WHLJHUXQJVUDWHVWHLJHUXQJ>V@UDWH
ELVSUHS_DGY
ELVSUHS
RQOLQHFRPSRXQGV
K\SKHQFRRUGLQDWLRQ
2YHU5XOHVDOVR%ULOOµV7%/
6FKDEHV5RFKHDSSURDFK
VXEJUDPPDUV
1DPHG(QWLW\)LQGHU
UXQGELV3UR]HQWSHUFHQWDJH13
G\QDPLFOH[LFRQ
UHIHUHQFHUHVROXWLRQ
7H[W&KDUW
3KUDVH5HFRJQL]HU
&ODXVH5HFRJQL]HU
UXQGELV3UR]HQW13
GHU6WHLJHUXQJVUDWH13
*HQHULF13V33V9*V
'LYLGHDQG&RQTXHUFKXQNSDUVHU
)LQLWHVWDWH
XML
Documents
;0/2XWSXW,QWHUIDFH
SDUDPHWUL]DEOH
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
WHFKQRORJ\
7RNHQL]HU
•
The goal of the TOKENIZER is to:
- map sequences of consecutive characters into word-like units (WRNHQV)
- identify the type of each token disregarding the context
- performing word segmentation when necessary (e.g., splitting
contractions into multiple tokens if necessary)
•
overall more than 50 classes (proved to simplify processing on higher stages)
-
•
NUMBER_WORD_COMPOUND (“HU”)
ABBREVIATION (CANDIDATE_FOR_ABBREVIATION),
COMPLEX_COMPOUND_FIRST_CAPITAL („$77&KLHI“)
COMPLEX_COMPOUND_FIRST_LOWER_DASH („Gµ,WDOLD&KHIV³
represented as single WFSA (406 KB)
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
/H[LFDO3URFHVVRU
•
Tasks of the LEXICAL PROCESSOR:
- retrieval of lexical information
- recognition of FRPSRXQGV („Autoradiozubehör“ - FDUUDGLRHTXLSPHQW)
- K\SKHQFRRUGLQDWLRQ („Leder-, Glas-, Holz- und Kunstoffbranche“
OHDWKHUJODVVZRRGHQDQGV\QWKHWLFPDWHULDOVLQGXVWU\
•
•
lexicon contains currently more than 700 000 German full-form words (tries)
each reading represented as triple <STEM,INFLECTION,POS>
example: „wagen“ (WRGDUH vs. DFDU)
STEM: „wag“
INFL: (GENDER: m,CASE: nom, NUMBER: sg)
(GENDER: m,CASE: akk, NUMBER: sg)
(GENDER: m,CASE: dat, NUMBER: sg)
(GENDER: m,CASE: nom, NUMBER: pl)
(GENDER: m,CASE: akk, NUMBER: pl)
(GENDER: m,CASE: dat, NUMBER: pl)
(GENDER: m,CASE: gen, NUMBER: pl)
POS: noun
STEM: „wag“
INFL: (FORM: infin)
(TENSE: pres, PERSON: anrede, NUMBER: sg)
(TENSE: pres, PERSON: anrede, NUMBER: pl)
(TENSE: pres, PERSON: 1, NUMBER: pl)
(TENSE: pres, PERSON: 3, NUMBER: pl)
(TENSE: subjunct-1, PERSON: anrede,
NUMBER: sg)
(TENSE: subjunct-1, PERSON: anrede,
NUMBER: pl)
(TENSE: subjunct-1, PERSON: 1, NUMBER: pl)
(TENSE: subjunct-1, PERSON: 3, NUMBER: pl)
(FORM: imp, PERSON: anrede)
POS: verb
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
3DUWRI6SHHFK)LOWHULQJ
•
The task of POS FILTER is to filter out unplausible readings of ambiguous
word forms
•
•
large amount of German word forms are ambiguous (20% in test corpus)
contextual filtering rules (ca. 100)
- example:
Ä6LHEHNDQQWHQGLHEHNDQQWHQ %LOGHUJHVWRKOHQ]XKDEHQ³
7KH\FRQIHVVHGWKH\KDYHVWROHQWKHIDPRXVSLFWXUHV
ÄEHNDQQWHQ³ WRFRQIHVV YVIDPRXV
FILTERING RULE: if the previous word form is determiner and the next
word form is a noun then filter out the verb reading of the current word
form
•
supplementary rules determined by Brill‘s tagger in order to achieve broader
coverage
•
rules represented as FSTs, hard-coded rules (filtering out rare readings)
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
1DPHG(QWLW\)LQGHU
•
The task of the NAMED ENTITY FINDER is the identification of:
- entities: organizations, persons, locations
- temporal expressions: time, date
- quantities: monetary values, percentages, numbers
•
Identification in two steps:
- recognition patterns expressed as WFSA are used to identify phrases containing
potential candidates for named entities
- additional constraints (depending on the type of a candidate) are used for
validating the candidates and an appropriate extraction rule is applied in order
to recover the named entity
H[DPSOH
: „von knapp neun Milliarden auf über 43 Milliarden Spanische Pesetas“
IURPDOPRVWQLQHELOOLRQVWRPRUHWKDQELOOLRQVVSDQLVK SHVHWDV
TYPE: monetary
SUBTYPE: monetary-prepositional-phrase
•
Longest match strategy
Course: Intelligent Information Extraction
Neumann & Xu
SPPC - 26
Esslli Summer School 2004
1DPHG(QWLW\)LQGHUFRQW
•
Arcs of the WFSAs are predicates on lexical items:
(a) 675,1*V, holds if the surface string mapped by current lexical item is of the form V
(b) 67(0V, holds if: the current lexical item has a preferred reading with stem Vor the
current lexical item does not have preferred reading, but at least one
reading with stem V
(c) 72.(1[, holds if the token type of the surface string mapped by current lexical
´
item is [
•
Example: simple automaton for recognition of company names
additional constraint: disallow determiner reading for the first word
candidate: „Die Braun GmbH & Co.“ extracted: „Braun GmbH & Co.“
Course: Intelligent Information Extraction
Neumann & Xu
SPPC - 27
Esslli Summer School 2004
1DPHG(QWLW\)LQGHUFRQW
•
Additional lexica for geographical names, first names (persons) and company
names compiled as WFSA (new token classes)
•
•
•
Named entities may appear without designators (companies, persons)
Dynamic lexicon for storing named entities without designators
Candidates for named entities, example:
'D IOFKWHQVLFK GLHHLQHQ LQV$XVODQGZLHHWZDGHU0QFKQHU
6WULFNZDUHQKHUVWHOOHU0lU] *PE+ RGHUGHUEDGLVFKH6WUXPSIIDEULNDQW $UOLQJWRQ
6RFNV*PE+$ENRPPHQGHP-DKUVWULFNW0lU] NQDSSGUHL9LHUWHOVHLQHU
3URGXNWLRQ LQ8QJDUQ
.
• Resolution of type ambiguity using the dynamic lexicon:
if an expression can be a person name or company name (0DUWLQ0DULHWWD&RUS.)
then use type of last entry inserted into dynamic lexicon for making decision
Course: Intelligent Information Extraction
Neumann & Xu
SPPC - 28
Esslli Summer School 2004
Performance of
Shallow Text Processing for German
•
%DVLV
corpus of German business magazine „Wirtschaftswoche“ (1,2MB, 197118 tokens)
•
3HUIRUPDQFH
~32sec. ( ~6160 wrds/sec; PentiumIII, 500MHz, 128Ram)
•
(YDOXDWLRQ
(20.000 tokens)
5HFDOO
Precision
•
compound analysis:
98.53%
99.29%
•
partpart-ofof-speechspeech-filterung:
74.50%
96.36%
•
Named entity (including NE reference resolution; all 85% R, 95.77%
95.77% P)
• person names:
81.27%
• companies:
67.34%
• locations:
75.11%
• total:
73.94%
95.92%
96.69%
88.20%
94.10%
fragments (NPs, PPs):
91.94%
•
76.11%
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
chunk parsing strategies for the improvment of
robustness and coverage on the sentence level
Text (morph. analysed)
Phrase
recognition
Stream of phrases
Clause
recognition
grammatical
fct. recognition
&XUUHQWFKXQNSDUVHU
bottom-up:
first phrases and then sentence structure
main problem:
even recognition of simple sentence structure depends on
performance of phrase recognition
example:
- complex NP (QRPLQDOL]DWLRQVW\OH)
- relative pronouns
>'LHYRP%XQGHVJHULFKWVKRIXQGGHQ:HWWEHZHUEHUQDOV
9HUVWRVVJHJHQGDV.DUWHOOYHUERWJHJHLVVHOWH]HQWUDOH79
Stream of sentences
9HUPDUNWXQJ@ LVWJlQJLJH3UD[LV
(>FHQWUDOWHOHYLVLRQPDUNHWLQJFHQVXUHGE\WKH*HUPDQ
)HGHUDO+LJK&RXUWDQGWKHJXDUGVDJDLQVWXQIDLUFRPSHWLWLRQ
DVDQDFWRIFRQWHPSWDJDLQVWWKHFDUWHOEDQ@ LVFRPPRQ
SUDFWLFH
)
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
A new chunk parser
'LYLGHDQGFRQTXHUVWUDWHJ\
Text (morph. analysed)
%XQGHVJUHQ]VFKXW]DEHUQLFKWEHVWlWLJHQ NHLQHQ*ODXEHQVFKHQNH
topological structures
[
.LQNHOVSUDFKYRQ+RUURU]DKOHQ
] [
'LHVH$QJDEHQNRQQWHGHU
[
Field
Recognizer
[
first compute topological structure of sentence
second apply phrase recognition to the fields
GHQHQHU
]].
7KLVLQIRUPDWLRQFRXOGQµWEHYHULILHGE\WKH%RUGHU3ROLFH
.LQNHOVSRNHRIKRUULEOHILJXUHVWKDWKHGLGQµWEHOLHYH
Phrase
Recognizer
sentence structures
Gramm.
Functions
(YDOXDWLRQ
400 sentences (6306 words)
Verb groups:
Clause struct.:
98.59% F
91.62% F
All components:
87.14% F
PRUHGHWDLOVLQ0DVWHUWKHVLVRI&%UDXQDQG
1HXPDQQ%UDXQ3LVNRUVNL$1/3
Fct. descriptions
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
The divide-and-conquer parser: a series of
finite state grammars
6WUHDPRIPRUSKV\QZRUGV
1DPHG(QWLWLHV
7RSRORJLFDO6WUXFWXUH
Weil die Siemens GmbH, die vom Export lebt, Verluste
erlitt, mußte sie Aktien verkaufen.
%HFDXVHWKH6LHPHQV&RUSZKLFKVWURQJO\GHSHQGVRQ
H[SRUWVVXIIHUHGIURPORVVHVWKH\KDGWRVHOOVRPHVKDUHV
9HUE*URXSV
%DVH&ODXVHV
&ODXVH&RPELQDWLRQ
Weil die Siemens GmbH, die vom Export Verb-FIN, Verluste VerbFIN, Modv-FIN sie Aktien FV-Inf.
Weil die Siemens GmbH, Rel-Clause Verluste Verb-FIN,
Modv-FIN sie Aktien FV-Inf.
0DLQ&ODXVHV
Subconj-Clause,
Modv-FIN sie Aktien FV-Inf.
3KUDVH5HFRJQLWLRQ
Clause
8QGHUVSHFLILHGGHSHQGHQF\WUHHV
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
$GYDQFHG0XOWLOLQJXDO,QIRUPDWLRQ([WUDFWLRQZLWK
63UR87
±
6KDOORZ3URFHVVLQJZLWK8QLILFDWLRQDQG7\SHG)HDWXUH6WUXFWXUHV
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
0RWLYDWLRQIRU63UR87
•
RQHSODWIRUPIRUGHYHORSPHQWRIPXOWLOLQJXDO673V\VWHPV
•
•
•
ZHOOGRFXPHQWHGSURJUDPPLQJ$3,
IOH[LEOHLQWHUIDFHWRGLIIHUHQWSURFHVVLQJPRGXOHV
•
•
•
GHYHORSPHQWWHVWLQJHQYLURQPHQWIRUUHVRXUFHV
7)6V DVDEVWUDFWLQWHUFKDQJHIRUPDW
;0/EDVHGUHSUHVHQWDWLRQ
ILQGJRRGWUDGHRIIEHWZHHQHIILFLHQF\DQGH[SUHVVLYHQHVV
•
•
ZRUGEDVHGILQLWHVWDWHGHYLFHV
W\SHGXQLILFDWLRQEDVHGJUDPPDUV
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
63UR87 ;7'/)RUPDOLVP
Æ
&RPELQHVW\SHGIHDWXUHVWUXFWXUHV7)6DQGUHJXODUH[SUHVVLRQV
LQFOXGLQJ
FRUHIHUHQFHV DQGIXQFWLRQDODSSOLFDWLRQ
Æ
;7'/JUDPPDUUXOHV± SURGXFWLRQSDUWRQ/+6DQGRXWSXW
GHVFULSWLRQRQ5+6
Æ
&RXSOHRIVWDQGDUGUHJXODURSHUDWRUV
FRQFDWHQDWLRQ
RSWLRQDOLW\
"
GLVMXQFWLRQ
_
.OHHQH VWDU
.OHHQH SOXV
QIROGUHSHWLWLRQ ^Q`
PQVSDQUHSHWLWLRQ
^PQ`
QHJDWLRQa
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
63UR87 $UFKLWHFWXUH
/,1*8,67,&
,1387'$7$
/(;,&$/
352&(66,1*
5(6285&(6
5(6285&(6
-7)6
675($02)
7(;7,7(06
;7'/
,17(535(7(5
6758&785('
&203,/(5
5(*8/$5
;7'/
*5$00$5
«>@>@>@«
287387'$7$
),1,7(
),1,7(67$7(
0$&+,1(
722/.,7
*5$00$5'(9(/230(17
(19,5210(17
21/,1(352&(66,1*
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
63UR87 ;7'/)RUPDOLVP
SS!PRUSK>3263UHS685)$&(SUHS,1)/>&$6(F@@
PRUSK>326'HW,1)/>&$6(F180%(5Q*(1'(5J@@"
PRUSK>326$GMHFWLYH,1)/>&$6(F180%(5Q*(1'(5J@@
PRUSK>326QRXQ685)$&(QRXQB
,1)/>&$6(F180%(5Q*(1'(5J@@
PRUSK>326QRXQ685)$&(QRXQB
,1)/>&$6(F180%(5Q*(1'(5J@@"
!SKUDVH>&$7SS
35(3SUHS
$*5DJU
>&$6(F180%(5Q*(1'(5J@
&25(B13FRUHBQS@
ZKHUHFRUHBQS
$SSHQGQRXQB´³QRXQB
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
63UR87 ;7'/*UDPPDU3URFHVVLQJ
Æ
7KUHHVWHSSURFHVVLQJ
Š
0DWFKLQJRIUHJXODUSDWWHUQVXVLQJ XQLILDELOLW\ /+6
Š
5XOHLQVWDQFHFUHDWLRQ
Š
8QLILFDWLRQRIWKHUXOHLQVWDQFHDQGPDWFKHGLQSXWVWUHDP
Æ
/RQJHVWPDWFKVWUDWHJ\
Æ
$PELJXLWLHVDOORZHG
Æ
2SWLPL]DWLRQWHFKQLTXHV
Š
3UREOHP7)6VWUHDWHGDVV\PEROLFYDOXHVE\)607RRONLW
Š
6RUWLQJRXWJRLQJWUDQVLWLRQVIURPVHOHFWHG VWDWHV WUDQVLWLRQ
KLHUDUFK\XQGHU
VXEVXPSWLRQ
Æ
5XOHSULRULWL]DWLRQ
Æ
2XWSXWPHUJLQJ
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
63UR87 3URFHVVLQJ5HVRXUFHV
7RNHQL]DWLRQ
)LQHJUDLQHGWRNHQFODVVLILFDWLRQRYHUFODVVHV
Š
'RPDLQDQGODQJXDJHVSHFLILFWRNHQVXEFODVVLILFDWLRQ

SURFACE : " Berlin"
TYPE : first_capital_word


SUBTYPE : city_sufix 
Š
Æ
Š
7RNHQSRVWVHJPHQWDWLRQ
Š
6XSSRUWVDOPRVWDOO(XURSHDQFKDUDFWHUVHWV
Š
0LQRUDGMXVWPHQWVQHFHVVDU\ZKLOHSRUWLQJWRQHZODQJXDJH
HJZRUGBZLWKBDSRVWURSKHFODVVLW¶V!LWµV
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
63UR87 3URFHVVLQJ5HVRXUFHV
Æ
0RUSKRORJ\
Š
)XOOIRUPOH[LFD0PRUSKYVH[WHUQDOPRUSKRORJ\FRPSRQHQWV
Š
2QOLQHVKDOORZFRPSRXQGUHFRJQLWLRQ
Š
&XUUHQWFRYHUDJH
/DQJXDJH
&RYHUDJH
(QJOLVK
*HUPDQ
)UHQFK
'XWFK
,WDOLDQ
6SDQLVK
3ROLVK
OH[HPHV
([WUDV
FRPSRXQGUHFRJQLWLRQ
FRPSRXQGUHFRJQLWLRQ
FD0LOZRUGIRUPV
&]HFK
-DSDQHVHVH FD)PHDVXUH
:RUGVHJPHQWDWLRQ326BWDJJHU
&KLQHVH
:RUGVHJPHQWDWLRQ326BWDJJHU
Course: Intelligent Information Extraction
Neumann & Xu
0RUSK'LVDPELJXDWLRQ
Esslli Summer School 2004
63UR87 3URFHVVLQJ5HVRXUFHV
Æ
*D]HWWHHU
Š
$OORZVIRUDVVRFLDWLQJHQWULHVZLWKDOLVWRIDUELWUDU\DWWULEXWHYDOXH
SDLUV
$UJHQW\Q\_*$=B7<3(FRXQWU\_&21&(37$UJHQW\QD_)8//B1$0(
5HSXEOLND$UJHQW\Q\
_&$6(JHQLWLYH_&$3,7$/%XHQRV$LUHV_FRQWLQHQW
Š
5HXVDJHRIDYDLODEOHUHVRXUFHVDFURVVODQJXDJHV
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
SURFACE : " Washington"

SURFACE : " Washington"
Š 0XOWLSOHUHDGLQJV
: " Washington"
SURFACE
TYPE : gaz_state

, TYPE : gaz_city

TYPE : gaz_surname




CONCEPT : c_washington_cityCONCEPT : c_washington_state
63UR87 3URFHVVLQJ5HVRXUFHV
Æ
5HIHUHQFH0DWFKHU
Š
)LQGVLGHQWLW\UHODWLRQEHWZHHQUHFRJQL]HGHQWLWLHVDQGXQFRQVXPHG
WH[W
IUDJPHQWV
«&(23URI'U0DUWLQ0DULHWWD PHQWLRQHGQHZWDNHRYHU««««
««««««««««««««««««««««««««««««
«««««0DUWLQ0DULHWWD*PE+SODQVWR«««««««««««
««««««««««««««««««««««««««««««
«««««««0DULHWWD VDLG««««««««««««««««
Š
*UDPPDULDQVGHILQHKRZYDULDQWVDUHEXLOW
SHUVRQ!ILUVWBQDPH
ODVWBQDPHWLWOH
!>),567B1$0(ILUVWBQDPH/$67B1$0(ODVWBQDPH
7,7/(WLWOH9$5,$17YDULDQW@
ZKHUHYDULDQW
Š
^&RQFWLWOHODVWBQDPHODVWBQDPH`
3DUDPHWUL]DEOH&RQWH[WXDO)UDPH
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
63UR87 ,'(
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
63UR87 ,'(
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
63UR87'HYHORSPHQW
(QYLURQPHQW
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
63UR87± ,PSOHPHQWDWLRQ$YDLODELOLW\
Æ
,PSOHPHQWDWLRQ
Š
Æ
&RUH&RPSRQHQWV
Š
)607RRONLW&-$9$$3,V
Š
-7)6-$9$
Š
;0/EDVHG*HQHUDO3XUSRVH5HJXODU&RPSLOHU -$9$&
Š
/LQJXLVWLF3URFHVVLQJ&RPSRQHQWV-$9$
Š
,QWHJUDWHG'HYHORSPHQW(QYLURQPHQW-$9$
$YDLODELOLW\
Š
,'(&RUH&RPSRQHQWV:LQGRZV/LQX[
Š
-$9$$3,V IRULQWHJUDWLRQLQRWKHUIUDPHZRUNV
Š
)UHHZDUH"
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
63UR87± $SSOLFDWLRQV
Æ
,QGXVWULDOSURMHFWV
Š
&&$± &XVWRPHU&DUH$XWRPDWLRQ7HOHNRP
4$6\VWHPZLWKGLDORJIDFLOLWLHVIRUWHOHFRPPXQLFDWLRQ
GRPDLQ
Š
$*526HUYHU&HOL65/
0RQLWRULQJDQG0LQLQJ&XVWRPHU2ULHQWDWLRQ
KWWSZZZFHOLLW
Æ
,QWHUQDOSURMHFWV
Š
([WUDOLQN
,QIRUPDWLRQ6\VWHPFRPELQLQJ,QIRUPDWLRQ([WUDFWLRQDQG
+\SHUOLQNLQJ
IRU7RXULVWLFGRPDLQ
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
63UR87± $SSOLFDWLRQV
Æ
(8IXQGHGSURMHFWV
Š
$,5)25&($,5)25H&DVW LQ(XURSH
7RROVIRUIRUIRUHFDVWLQJDLUWUDIILFLQ(XURSH
Š
0(03+,60XOWLOLQJXDO&RQWHQWIRU)OH[LEOH)RUPDW,QWHUQHW
3UHPLXP6HUYLFHV
3ODWIRUPIRUFURVVOLQJXDOSUHPLXPFRQWHQWVHUYLFHVIRUWKLQ
FOLHQWV
KWWSZZZLVWPHPSKLVRUJ
Š
'((37+28*+7
+\EULGV\VWHPFRPELQLQJVKDOORZDQGGHHSPHWKRGVIRU
NQRZOHGJH
LQWHQVLYHLQIRUPDWLRQH[WUDFWLRQ
KWWSZZZHXULFHGHGHHSWKRXJKW
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Combining Deep and Shallow NLP
for IE
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Isn‘t SNLP enough?
• Many simple NL tasks only need SNLP
• E.g., text clustering, Named Entity recognition, term extraction,
binary relation extraction
• Information extraction: 60% f-measure barrier in case of scenario
template extraction (e.g., PDQDJHPHQWVXFFHVVLRQComplex
phenomena:
•
•
•
•
Grammatical function/deep case recognition
Free word order in German
Multiple sentences
Reference resolution for template merging
• Content-oriented Web-services/Semantic Web
• More precise structural extraction needed
• E.g., ontology extraction, question/answering systems
• Shallow systems are getting deeper, requesting more and more
accurate linguistic information
• integration of deep NL components
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Possible Integration Scenario
The parallel approach
The Whiteboard approach
•
❏
Run SNLP and DNLP in
parallel
•
•
•
ZKHUH FRPSRQHQWVFDQSOD\DWWKHLU
prefer results from DNLP
Problem: different processing
speed when processing large
quantities of text:
,QWHJUDWHGIOH[LEOHDUFKLWHFWXUH
VWUHQJWKV
❏
&DOO'1/3RQUHVXOWVRI61/3
‡
5HVXOWVIURP 61/3 XVHG WR
LGHQWLI\UHOHYDQWFDQGLGDWHVIRU
Precision: slowest component
⇒ effects overall speed
RQGHPDQG XVHRI'1/3 GRPDLQ
VSHFLILFFULWHULD
•
Speed: overall precision hardly
distinguishable from shallowonly system
⇒ DNLP always time out
‡
5REXVWQHVV KDQGOHGE\ 61/3 WR
LQFUHDVHWKHFRYHUDJHRI'1/3
‡
8VH61/3WRJXLGH'1/3 WRZDUGV
WKHPRVWOLNHO\V\QWDFWLFDQDO\VLV
OHDGLQJ WRLPSURYHG VSHHGXS
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Combining the Best of the Two Worlds for IE
ƒ
utilization of SNLP as primary linguistic resources for
• robustness, efficiency and domain-specific interpretation of local
relationships
ƒ
integration of DNLP, if the analysis exists and it is relevant,
for extracting
• relationships embedded in complex linguistic constructions,e.g., free
word order, long distance dependencies, control and raising, or
passive
ƒ
combination of finite-state technology with unification-based
formalism for
• definition of template filling rules, scenario templates and template
elements
• using unification and subsumption check as basic operations for
template merging
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
A Simple Example
After the retirement of Peter Smith,
Mary Hopp was persuaded to take over the development
sector.
1]
shallow: retirement of 1 PN
[Person_Out
deep:
Pred
Agent
Theme
Person_In
Sector
„take over“
2 „Mary Hopp“
3 „development sector“
2
3
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Head-Driven Pharse Structure Grammar
(HPSG)
• A constraint-based, lexicalist approach to grammatical
theory
• Model languages as systems of constraints
• Linguistic units and their relations are expressed by
typed feature structures
• No transformations for unbounded dependencies
• Multiple inheritance hierarchies for lexicon organization
• More information see
• http://lingo.stanford.edu/erg.html
• http://www.coli.uni-sb.de/%7Ehansu/courses/hpsg_arch.html
• http://www.coli.uni-sb.de/%7Ehansu/psfiles/hpsg.ps
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
A Phrase Structure Example
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
HPSG
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
WHIE: A Hybrid IE Approach
• hybrid information extraction architecture built on
top of Whiteboard Annotation Machine (WHAM)
• domain modeling for hybrid architecture
• semi-automatic construction of template filling rules
• heuristics for domain/relevance-driven access to
different levels of WHAM
• evaluation
• hybrid approach
• improvement after integration of deep NLP
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
,QWHJUDWLRQRI6KDOORZDQG'HHS$QDO\VLV
*RDORILQWHJUDWHGµK\EULG¶V\QWDFWLFSURFHVVLQJ
¾
¾
5REXVWQHVVDQGHIILFLHQF\
of shallow analysis
3UHFLVLRQDQGILQHJUDLQHGQHVV
of deep syntactic analysis
/H[LFDO,QWHJUDWLRQ
• SPPC-HPSG interface: building HPSG lexicon entries “on the fly”
– Named entities, open class categories (nouns, adjectives, adverbs, ..)
•
HPSG-GermaNet integration
(Siegel, Xu, Neumann 2001)
– association with HPSG lexical sorts
¾
FRYHUDJHDQGUREXVWQHVV
3KUDVDOLQWHJUDWLRQIRUµK\EULG¶V\QWDFWLFSURFHVVLQJ
•
•
Robust, stochastic topological field parsing
Integration of shallow topological and deep syntactic parsing
¾
HIILFLHQF\DQGUREXVWQHVV
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
,QWHJUDWLRQRI6KDOORZDQG'HHS$QDO\VLV
— Problems and Solutions —
,QWHJUDWHGRIVKDOORZDQGGHHSSDUVLQJ
¾
¾
Guiding deep analysis by shallow (chunk-based) pre-partitioning
Interleaved (hybrid) shallow/deep processing
7KHGHHSVKDOORZPDSSLQJSUREOHP
• Chunk parsing not isomorphic to deep syntactic
structure („attachments“)
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Template Extraction System
(Xu & Krieger, 2003)
Type Hierarchy
Discourse
Template Merging
Template
Merging
Rules
Sentential
Template Merging
Shallow TF
(SProUT)
Pattern-Based
TF Rules
Subsumption
Check
and
Unification
Deep TF
Chunks
PredArgs
PredArg-Based
TF Rules
WHAM
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Typed Feature Structure (TFS) for IE task
•
$VDIRUPDOODQJXDJHIRUWKHPRGHOLQJRIWKHGRPDLQ
YRFDEXODU\DQGWKHLUUHODWLRQVLQWKHW\SHGKLHUDUFK\
Template elements
Scenario templates
Linguistic units
Domain ontologies
Rules for template filling and template merging
•
'\QDPLFRQOLQHFRQVWUXFWLRQRIWHPSODWHV
•
7)6XQLILHUVXSSRUWVZHOOIRUPHGXQLILFDWLRQRI7)6V
DQGVXEVXPSWLRQFKHFN
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Template filling with deep and shallow analysis
(Example)
,QWHUDFWLRQRISDVVLYHFRQWURODQGPXOWLSOHQDPHGHQWLWLHV
%HFDXVHRIWKHUHWLUHPHQWYRQ3HWHU0OOHU
+DQV%HFNHUZDVSHUVXDGHGWRWDNHRYHUWKHGHYHORSPHQWVHFWRU
3DWWHUQEDVHG7)UXOH6KDOORZ7)
UHWLUHPHQW‡YRQ‡31
1
32
1
3(5621B287´3HWHU0OOHUµ
'HHS7)5XOH
3(5621B,1
1
',9,6,21
2
'((3
35('
$*(17
7+(0(
3(5621B287
´3HWHU0OOHUµ
3(5621B,1
´+DQV%HFNHUµ
',9,6,21
´GHYHORSPHQW
VHFWRUµ
Course: Intelligent Information Extraction
25*$1,6$7,21
Neumann & Xu
326,7,21
Esslli Summer School 2004
´WDNHRYHUµ
1´+DQV%HFNHUµ
2
´GHYHORSPHQW
VHFWRUµ
Semi-Automatic Construction of Template
Filling Rules
ƒ
learning relevant terms (Xu et al. 2002)
− a KFIDF term-classification method
− verb-noun collocations
ƒ
adaptation of relevant terms to hybrid architecture
−
6KDOORZ
: adjectives and nouns as triggers for pattern-based template
filling rules, e.g.,
The previous president of car supplier Kolbenschmidt, Heinrich Binder
Successor of the officeholder Hans Guenter Merk
−
'HHS
: verbs as triggers for lexicalized unification-based template
filling rules, e.g.,
General manager Eugen Krammer (59), …, will resign from his office on May 31.
1997
Ö
if a domain is dominated by relevant verbs, it is helpful to integrate
DNLP
Ö domains can be
Course: Intelligent
Information Extraction
Neumann & Xu
to POSs
characterized by the distribution of terms with respect
Esslli Summer School 2004
Single-Word Term Extraction
a specific TFIDF measure: KFIDF
a word is relevant if it occurs more frequently than other words
in a certain category and rarely in other categories.
.),')
( Z, FDW ) = GRFV ( Z, FDW ) × /2* (
Q
×
FDWV
FDWV
( Z)
+ 1)
GRFVZFDW
= number of documents in the category FDW containing
the word Z
Q = smoothing factor
FDWVZ = the number of categories FDWV in which the word Z
occurs
$VLPLODUPHDVXUHLVDOVRXVHGLQ>%XLWHODDU6DFDOHDQX
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
@
Term Classification
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Experiment
• Input: classified documents
• Domains: stock market, management
succession and narcotics
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Stock Market
€ PRVWUHOHYDQWWHUPVDUHQRXQV
$NWLHQE|UVH
9HUlQGHUXQJ
*HZLQQHU
9HUOLHUHU
+RFKWLHI
7LHI
&DUERQ
$NWLH
.XUV [stock-exchange]
[change]
[winner]
[loser]
[up and down]
[deep]
[carbon]
[stocks]
[stock price]
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Management Succession
€ PRVWUHOHYDQWWHUPVDUHYHUEV
EHUXIHQ
ZlKOHQ
EHUQHKPHQ
EHVWHOOHQ
YHUODVVHQ
ZHFKVHOQ
DXVVFKHLGHQ
QDFKIROJHQ
]XUFNWUHWHQ
DQWUHWHQ [appoint to]
[choose]
[accept]
[nominate]
[leave]
[change]
[resign]
[succeed]
[resign]
[assume office]
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Learning Relevant Verb Noun Collocations
€ YHUEQRXQFROORFDWLRQV
5XKHVWDQG
/HLWXQJ
[retirement], WUHWHQ[step]
[leadership],
EHUQHKPHQ
[take over]
€ *HUPDQIUHHZRUGRUGHU
Ö
not sufficient to take into account only bigrams, trigrams, e.g., the word
order between verbs and nouns is not fixed and very often discontinuous
Ö
consider all possible term pairs in a sentence ignoring the linear order
- verb noun
- adjective noun
- noun noun
€ XVLQJGLIIHUHQWDVVRFLDWLRQPHDVXUHV(YHUW .UHQQ
Ö
mutual information (Church & Hanks, 1989)
Ö
T-test (Daille, 1996)
Ö
Log-Likelihood (Manning & Schütze, 1999)
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Integrating DNLP on Demand
If a sentence contains relevant verbs and verbnoun collocations in addition to other terms, the
sentence should be passed to DNLP,
otherwise, SNLP will be sufficient
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Evaluation of Template Extraction
• both at sentential level
• proof of
• usefulness of hybrid architecture
• how much improvement deep NLP helps to achieve
• whether domain modeling gives right predication
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Evaluation at Sentential Level:
Corpus
• construction of an evaluation corpus based on the
following criterion
Ö
the more relevant verbs and nouns a sentence contains, the more relevant
is the sentence in this domain.
19
56
= 19 ∗ + OJ ∑
L
=
11
597 L
+ 11 ∗ + OJ ∑
M
=
517
M
where 19> and 11> .
56
: relevance of a sentence
19: number of relevant verbs in a sentence
11: number of relevant nouns in a sentence
597 : the relevance of a verb term i occurring in the sentence
517 : the relevance of a noun term j occurring in the sentence
L
M
• selection of 50 top relevant sentences in management
succession domain
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Annotation (manual)
ƒ
partially filled templates based on shallow and deep template filling
rules
ƒ construction of an idealized case, which is not affected by the
performance of the current system
<sentence id=1>
VKDOORZ!
Dr. Heinz Neckar ...
appointed ... manager
of …, as successor of
Dr. Herbert Ehrlich,
retired.
<template name=”management_succession”>
<slot name=”person_out”>'U+HUEHUW(KUOLFK </slot>
</template>
VKDOORZ!
GHHS!
<template name=”management_succession”>
<slot name=”person_in”>'U+HLQ]1HFNDU </slot>
</template>
<template name=”management_succession”>
<slot name=”person_out”>'U +HUEHUW(KUOLFK </slot>
</template>
GHHS!
</sentence>
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Evaluation at Sentential Level: Scenarios
• applying sentential template merging to
• shallow templates only
• deep templates only
• fusion of shallow and deep templates
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Evaluation Results
6KDOORZ
'HHS
XQLRQ6KDOORZ'HHS
FRYHUDJH
UHFDOO
FRYHUDJH
UHFDOO
FRYHUDJH
UHFDOO
0.56
0.31
0.94
0.68
0.96
0.92
coverage= the percentage of sentences, to which the template filler rules can be
applied
precision=1.0 (manual annotation)
¾
verbs play a more important role than nouns and adjectives in management
succession domain ⇒ domain modeling
¾
information extracted by shallow and deep template filling rules is
complementary to each other in most cases ⇒ hybrid architecture
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Additional Important Issues
• Build rich and robust semantic representation
for IE
• Domain specific discourse analysis for
template merging
• Anaphora resolution
• Ontology inference
• Evaluation
Course: Intelligent Information Extraction
Neumann & Xu
Esslli Summer School 2004
Download