Relation Extraction

advertisement
Relation Extraction
Pierre Bourreau
LSI-UPC
PLN-PTM
Plan

Relation Extraction description

Sampling templates

Reducing deep analysis errors…

Conclusion
Relation Extraction Description


Finding relations between entities into a text
Filling pre-defined templates slots



One-value-per-field
Multi-value
Depend on analysis:



Chunking
Tokenization
Sentence Parsing…
Plan

Relation Extraction description

Sampling templates (Cox, Nicolson, Finkel,
Manning)

Reducing deep analysis errors…

Conclusion
First Example: Sampling Templates



Example: workshop announcement
PASCAL corpus
Relations to extract:



dates of events
Workshop conferences names, acronyms and
URL
Domain knowledge:


Constraints on dates
Constraints on names
PASCAL Corpus: semi-structured corpus



























<0.26.4.95.11.09.31.hf08+@andrew.cmu.edu.0>
Type: cmu.andrew.academic.bio
Topic: "MHC Class II: A Target for Specific Immunomodulation of the
Immune Response"
Dates: 3-May-95
Time: <stime>3:30 PM</stime>
Place: <location>Mellon Institute Conference Room</location>
PostedBy: Helena R. Frey on 26-Apr-95 at 11:09 from andrew.cmu.edu
Abstract:
Seminar: Departments of Biological Sciences
Carnegie Mellon and University of Pittsburgh
Name: <speaker>Dr. Jeffrey D. Hermes</speaker>
Affiliation: Department of Autoimmune Diseases Research & Biophysical Chemistry
Merck Research Laboratories
Title: "MHC Class II: A Target for Specific Immunomodulation of the
Immune Response"
Host/e-mail: Robert Murphy, murphy@a.cfr.cmu.edu
Date: Wednesday, May 3, 1995
Time: <stime>3:30 p.m.</stime>
Place: <location>Mellon Institute Conference Room</location>
Sponsor: MERCK RESEARCH LABORATORIES
Schedule for 1995 follows: (as of 4/26/95)
Biological Sciences
Seminars
1994-1995
Date
Speaker
Host
April 26 Helen Salz
Javier L~pez
May 3
Jefferey Hermes
Bob Murphy
MERCK RESEARCH LABORATORIES
PASCAL Corpus: semi-structured corpus
















<1.21.10.93.17.00.39.rf1u+@andrew.cmu.edu.0>
Type: cmu.andrew.org.heinz.great-lake
Topic: Re: PresentationCC:
Dates: 25-Oct-93
Time: <stime>12:30</stime>
PostedBy: Richard Florida on 21-Oct-93 at 17:00 from andrew.cmu.edu
Abstract:
Folks:
<paragraph> <sentence>Our client has requested that the presentation be postponed until Monday
during regular class-time</sentence>. <sentence>He has been asked to make a presentaion for
the Governor of Michigan and Premier of Ontario tommorrow morning in
Canada, and was afraid he could not catch a plane in time to make our
presentation</sentence>. <sentence>After consulting with Rafael and a sub group of project
managers, it was decided that Monday was the best feasible presentation
alternative</sentence>. <sentence>Greg has been able to secure Room 2503 in Hamburg Hall for
our presentation Monday during regular class-time</sentence>. </paragraph>

<paragraph><sentence>We will meet tommmorow in <location>2110</location> at <stime>12:30</stime> (lunch provided) to finalize
presentation and briefing book</sentence>. <sentence>Also, the client has faxed a list of
reactions and questions for discussion which we should review</sentence>.
<sentence>Thanks very much for your hard work and understanding</sentence>. <sentence>Look forward to
seeing you tommorrow</sentence>.</paragraph>

Richard




Idea

Sampling Templates:



Generate all available templates
Give a probability to each of them
Relational model:

Constraints on dates: order




1. submission dates
2. acceptance dates
3. workshop dates / camera ready dates
Constraints on names.


Slots: name, acronym, URL
URL is generated from acronyms
Baselines

CRF




Cliques: max=2
Viterbi algorithm
Token => GATE tokenization
CMM


Idem
Window of the four previous tokens
Templates sampling


Tokens  p(Li|Li-1) or p(Li|Li-1,…, Li-4) on 100
of documents
Template:


Each slot holds one/no filler value
-> date templates:




SUB_DATE
ACC_DATE
WORK_DATE
CAMREADY_DATE
Templates sampling


Tokens  p(Li|Li-1) or p(Li|Li-1,…, Li-4) on 100
of documents
Template:


Each slot holds one/no filler value
-> name templates:






CONF_NAME
CONF_ACRO
CONF_URL
WORK_NAME
WORK_ACRO
WORK_URL
Templates sampling

D a distribution of these templates, over the
training set. => LOCAL MODEL (PL)
Templates scoring: Date Model



PA/P: Probability of present/absent fields. Set
with training data
Po: Ordering probability. We give penalty to
constraints violations.
PA/P* Po = Prel
Templates scoring: Name Model




Name->Acronym: independent module
(likelihood score – Chang 2002): Pnam->acr
Acronym->URL: empirical probability from
training: Pacr->url
Pb: missing entry give advantage to
incomplete templates.
 PA/P: pondering templates (in training, most
values are filled)
Prel= Pnam->acr *Pacr->url *PA/P
Results: 300 documents
Results

No results over CRF

CRF accepts variation (ex: name)


Rel. Model does not improve CRF (not on
graph)


=> lower recall
Low-window of CRF => less info in distribution.
Substantial improvement over CMM (5%)
Plan

Relation Extraction description

Sampling templates

Reducing deep analysis errors (Zao,
Grishman)

Conclusion
Problematic

Use different syntactic analysis for the task:






Tokenization
Chunking
Sentence Parsing
…
The more info they give, the less accurate
they are.
=>combine them to correct errors
ACE task… remember

Entities:


Mentions:


PERson – ORGanisation – FACility –
GeoPoliticEntity - LOCation – WEApon – VEHicle
NAM (proper), NOM (nominal), PRO (pronoun)
Relations:

EMP-ORG, PHYS, GPE-AFF, PER-SOC, DISC,
ART, Other
Kernel, SVM … nice properties

Kernel:




Function replacing scalar vector products
Enables us to translate problems into a higherdimension space for solution
Sum, product generates kernels.
SVM:

SVM can pick up features for best separation
The relational model

R=(arg1, arg2, seq, link, path)





T=(word, pos, base)





type: according to ACE type
subtype: refining
mtype: the way it is mentioned
DT=(T, dseq)


pos: Part Of Speech tagging
base: morphological base
E=(tk, type, subtype, mtype)


arg1, arg2: the two entities to compare
seq=(t1, …, tn): sequence of tokens intervening
link=(t1, …, tm): idem seq but just with important words
path: a dependency path…
dseq=(arc1, …, arcn)
ARC=(w, dw, label, e)




w: current token
dw: token connected to w
label: role label of this arc
e: direction of the arc
The relational model: example
arg1=((“areas”, “NNS”, “area”,
dseq), “LOC”, “region”,
“NOM”)
 arg1.dseq=((OBJ, areas, in,
1), (OBJ, areas, controlled,
1))
path=((OBJ, areas, controlled,
1), (SBJ, controlled, troops,
0))
Kernels
Argument kernel:
1.

Matches two tokens,
comparing each fix
arguments (word, pos,
type…)
Bigram kernel:
2.

Matches token on a
window of size 1
Link sequence kernel:
3.

Relations often occur in
a short context.
Kernels (2)
4. Dependency path kernel:

How similar are two paths?
5. Local dependency kernel:


Idem as path but more
informative.
Helpful if dependency path
does not exist.
Results: adding info into SVM


The more information
we give, the better the
result.
Link Sequence Kernel
boosts results.
Results: SVM or KNN




SVM behaves globally
better
Polynomial extension has
no consequence on KNN.
Training problem in the
last three.
… good results over ACE
official task… secret, no
comparison available
Conclusion





Really simple method
Nice properties of Kernel/SVMs
This method is generic!!! (tested on
annotated text)
Looks like SVM can process better, for this
task.
… but hard to compare the two methods as
goals are different.
References


[1] Template Sampling for Leveraging Domain Knowledge in
Information Extraction. Cox, Nicolson, Finkel, Manning, Langley. Stanford
University.
[2] Extracting Relations with Integrated Information Using Kernel
Methods. Zao, Grishman. New York University. 2005
Download