kdd-tutorial

advertisement
Information Extraction
and Integration: an Overview
William W. Cohen
Carnegie Mellon University
April 26, 2004
Example: The Problem
Martin Baker, a person
Genomics job
Employers job posting form
Example: A Solution
Extracting Job Openings from the Web
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.htm
OtherCompanyJobs: foodscience.com-Job1
Category = Food Services
Keyword = Baker
Location = Continental U.S.
Job Openings:
IE from Research Papers
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
NAME
TITLE
ORGANIZATION
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
IE
NAME
Bill Gates
Bill Veghte
Richard Stallman
TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + clustering + association
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
aka “named entity
Gates
extraction”
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Tutorial Outline
• IE History
• Landscape of problems and solutions
• Models for named entity recognition:
– Sliding window
– Boundary finding
– Finite state machines
• Overview of related problems and solutions
– Association, Clustering
– Integration with Data Mining
IE History
Pre-Web
• Mostly news articles
– De Jong’s FRUMP [1982]
• Hand-built system to fill Schank-style “scripts” from news wire
– Message Understanding Conference (MUC) DARPA [’87-’95],
TIPSTER [’92-’96]
• Early work dominated by hand-built models
– E.g. SRI’s FASTUS, hand-built FSMs.
– But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and
then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]
Web
• AAAI ’94 Spring Symposium on “Software Agents”
– Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.
• Tom Mitchell’s WebKB, ‘96
– Build KB’s from the Web.
• Wrapper Induction
– Initially hand-build, then ML: [Soderland ’96], [Kushmeric ’97],…
– Citeseer; Cora; FlipDog; contEd courses, corpInfo, …
IE History
Biology
• Gene/protein entity extraction
• Protein/protein fact interaction
• Automated curation/integration of databases
– At CMU: SLIF (Murphy et al, subcellular information
from images + text in journal articles)
Email
• EPCA, PAL, RADAR, CALO: intelligent office
assistant that “understands” some part of email
– At CMU: web site update requests, office-space
requests; calendar scheduling requests; social
network analysis of email.
IE is different in different domains!
Example: on web there is less grammar, but more formatting & linking
Newswire
Web
www.apple.com/retail
Apple to Open Its First Retail Store
in New York City
MACWORLD EXPO, NEW YORK--July 17, 2002-Apple's first retail store in New York City will open in
Manhattan's SoHo district on Thursday, July 18 at
8:00 a.m. EDT. The SoHo store will be Apple's
largest retail store to date and is a stunning example
of Apple's commitment to offering customers the
world's best computer shopping experience.
www.apple.com/retail/soho
www.apple.com/retail/soho/theatre.html
"Fourteen months after opening our first retail store,
our 31 stores are attracting over 100,000 visitors
each week," said Steve Jobs, Apple's CEO. "We
hope our SoHo store will surprise and delight both
Mac and PC users who want to see everything the
Mac can do to enhance their digital lifestyles."
The directory structure, link structure,
formatting & layout of the Web is its own
new grammar.
Landscape of IE Tasks (1/4):
Degree of Formatting
Text paragraphs
without formatting
Grammatical sentences
and some formatting & links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.
Non-grammatical snippets,
rich formatting & links
Tables
Landscape of IE Tasks (2/4):
Intended Breadth of Coverage
Web site specific
Formatting
Amazon.com Book Pages
Genre specific
Layout
Resumes
Wide, non-specific
Language
University Names
Landscape of IE Tasks (3/4):
Complexity
E.g. word patterns:
Closed set
Regular set
U.S. states
U.S. phone numbers
He was born in Alabama…
Phone: (413) 545-1323
The big Wyoming sky…
The CALD main office can be
reached at 412-268-1299
Complex pattern
U.S. postal addresses
University of Arkansas
P.O. Box 140
Hope, AR 71802
Headquarters:
1128 Main Street, 4th Floor
Cincinnati, Ohio 45210
Ambiguous patterns,
needing context and
many sources of evidence
Person names
…was among the six houses
sold by Hope Feldman that year.
Pawel Opalinski, Software
Engineer at WhizBang Labs.
Landscape of IE Tasks (4/4):
Single Field/Record
Jack Welch will retire as CEO of General Electric tomorrow. The top role
at the Connecticut company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
Person: Jack Welch
Relation: Person-Title
Person: Jack Welch
Title:
CEO
Person: Jeffrey Immelt
Location: Connecticut
“Named entity” extraction
Relation: Company-Location
Company: General Electric
Location: Connecticut
N-ary record
Relation:
Company:
Title:
Out:
In:
Succession
General Electric
CEO
Jack Welsh
Jeffrey Immelt
Landscape of IE Techniques (1/1):
Models
Classify Pre-segmented
Candidates
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama
Alaska
…
Wisconsin
Wyoming
Boundary Models
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
Classifier
which class?
which class?
Try alternate
window sizes:
Finite State Machines
Abraham Lincoln was born in Kentucky.
Context Free Grammars
Abraham Lincoln was born in Kentucky.
BEGIN
Most likely state sequence?
NNP
NNP
V
V
P
Classifier
PP
which class?
VP
NP
BEGIN
END
BEGIN
NP
END
VP
S
Any of these models can be used to capture words, formatting or both.
Sliding Windows
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
A “Naïve Bayes” Sliding Window Model
[Freitag 1997]
…
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun …
w t-m
w t-1 w t
w t+n
w t+n+1
w t+n+m
prefix
contents
suffix
Estimate Pr(LOCATION|window) using Bayes rule
Try all “reasonable” windows (vary length, position)
Assume independence for length, prefix words, suffix words, content words
Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)
If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.
“Naïve Bayes” Sliding Window Results
Domain: CMU UseNet Seminar Announcements
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during
the 1980s and 1990s.
As a result of its
success and growth, machine learning is
evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning), genetic
algorithms, connectionist learning, hybrid
systems, and so on.
Field
Person Name:
Location:
Start Time:
F1
30%
61%
98%
SRV: a realistic sliding-window-classifier
IE system
[Frietag AAAI ‘98]
• What windows to consider?
– all windows containing as many tokens as the shortest
example, but no more tokens than the longest example
• How to represent a classifier? It might:
– Restrict the length of window;
– Restrict the vocabulary or formatting used
before/after/inside window;
– Restrict the relative order of tokens;
– Use inductive logic programming techniques to express all
these…
<title>Course Information for CS213</title>
<h1>CS 213 C++ Programming</h1>
SRV: a rule-learner for sliding-window
classification
• Primitive predicates used by SRV:
– token(X,W), allLowerCase(W), numerical(W), …
– nextToken(W,U), previousToken(W,V)
• HTML-specific predicates:
– inTitleTag(W), inH1Tag(W), inEmTag(W),…
– emphasized(W) = “inEmTag(W) or inBTag(W) or …”
– tableNextCol(W,U) = “U is some token in the column
after the column W is in”
– tablePreviousCol(W,V), tableRowHeader(W,T),…
SRV: a rule-learner for sliding-window
classification
• Non-primitive “conditions” used by SRV:
–
–
–
–
every(+X, f, c) = for all W in X : f(W)=c
some(+X, W, <f1,…,fk>, g, c)= exists W: g(fk(…(f1(W)…))=c
tokenLength(+X, relop, c):
position(+W,direction,relop, c):
• e.g., tokenLength(X,>,4), position(W,fromEnd,<,2)
courseNumber(X) :tokenLength(X,=,2),
every(X, inTitle, false),
some(X, A, <previousToken>, inTitle, true),
some(X, B, <>. tripleton, true)
Non-primitive
conditions
make greedy
search easier
Rapier – results vs. SRV
Rule-learning approaches to slidingwindow classification: Summary
• SRV, Rapier, and WHISK [Soderland KDD ‘97]
– Representations for classifiers allow restriction of the
relationships between tokens, etc
– Representations are carefully chosen subsets of even more
powerful representations based on logic programming (ILP
and Prolog)
– Use of these “heavyweight” representations is complicated,
but seems to pay off in results
• Some questions to consider:
– Can simpler, propositional representations for classifiers
work (see Roth and Yih)
– What learning methods to consider (NB, ILP, boosting,
semi-supervised – see Collins & Singer)
– When do we want to use this method vs fancier ones?
BWI: Learning to detect boundaries
[Freitag & Kushmerick, AAAI 2000]
• Another formulation: learn three probabilistic
classifiers:
– START(i) = Prob( position i starts a field)
– END(j) = Prob( position j ends a field)
– LEN(k) = Prob( an extracted field has length k)
• Then score a possible extraction (i,j) by
START(i) * END(j) * LEN(j-i)
• LEN(k) is estimated from a histogram
BWI: Learning to detect boundaries
Field
Person Name:
Location:
Start Time:
F1
30%
61%
98%
Problems with Sliding Windows
and Boundary Finders
• Decisions in neighboring parts of the input are made
independently from each other.
– Expensive for long entity names
– Sliding Window may predict a “seminar end time” before the
“seminar start time”.
– It is possible for two overlapping windows to both be above
threshold.
– In a Boundary-Finding system, left boundaries are laid down
independently from right boundaries, and their pairing
happens as a separate step.
Finite State Machines
IE with Hidden Markov Models
Given a sequence of observations:
Yesterday Pedro Domingos spoke this example sentence.
and a trained HMM:
person name
location name
background
 
Find the most likely state sequence: (Viterbi) arg max s P( s , o )
Yesterday Pedro Domingos spoke this example sentence.
Any words said to be generated by the designated “person name”
state extract as a person name:
Person name: Pedro Domingos
HMM for Segmentation
• Simplest Model: One state per entity type
What is a “symbol” ???
Cohen => “Cohen”, “cohen”, “Xxxxx”, “Xx”, … ?
4601 => “4601”, “9999”, “9+”, “number”, … ?
All
Numbers
3-digits
000..
...999
Words
5-digits
00000..
..99999
Others
0..99
0000..9999
Chars
000000..
A..
Delimiters
Multi-letter . , / - + ? #
..z
aa..
Datamold: choose best abstraction level using holdout set
HMM Example: “Nymble”
[Bikel, et al 1998],
[BBN “IdentiFinder”]
Task: Named Entity Extraction
Person
start-ofsentence
end-ofsentence
Org
Other
Train on ~500k words of news wire text.
Case
Mixed
Upper
Mixed
Observation
probabilities
P(st | st-1, ot-1 )
P(ot | st , st-1 )
or
(Five other name classes)
Results:
Transition
probabilities
Language
English
English
Spanish
P(ot | st , ot-1 )
Back-off to:
Back-off to:
P(st | st-1 )
P(ot | st )
P(st )
P(ot )
F1 .
93%
91%
90%
Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99]
What is a symbol?
Bikel et al mix symbols from two abstraction levels
What is a symbol?
Ideally we would like to use many, arbitrary, overlapping
features of words.
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
…
S t-1
St
S t+1
…
is “Wisniewski”
part of
noun phrase
…
ends in
“-ski”
O
t -1
Ot
O t +1
Lots of learning systems are not confounded by multiple, nonindependent features: decision trees, neural nets, SVMs, …
What is a symbol?
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
…
S t-1
St
S t+1
…
is “Wisniewski”
…
part of
noun phrase
ends in
“-ski”
O
t -1
Ot
O t +1
Idea: replace generative model in HMM with a maxent
model, where state depends on observations
Pr( st | xt )  ...
What is a symbol?
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
…
S t-1
St
S t+1
…
is “Wisniewski”
part of
noun phrase
…
ends in
“-ski”
O
t -1
Ot
O t +1
Idea: replace generative model in HMM with a maxent
model, where state depends on observations and
previous state
Pr( st | xt , st 1, )  ...
What is a symbol?
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
…
S t-1
St
S t+1
…
is “Wisniewski”
part of
noun phrase
…
ends in
“-ski”
O
t -1
Ot
O t +1
Idea: replace generative model in HMM with a maxent
model, where state depends on observations and
previous state history
Pr( st | xt , st 1, st  2, ...)  ...
Ratnaparkhi’s MXPOST
• Sequential learning problem:
predict POS tags of words.
• Uses MaxEnt model
described above.
• Rich feature set.
• To smooth, discard features
occurring < 10 times.
Conditional Markov Models (CMMs) aka
MEMMs aka Maxent Taggers vs HMMS
St-1
St
St+1
...
Pr( s, o)   Pr( si | si 1 ) Pr(oi | si 1 )
i
Ot-1
Ot
St-1
Ot+1
St
St+1
...
Pr( s | o)   Pr( si | si 1 , oi 1 )
i
Ot-1
Ot
Ot+1
Label Bias Problem
• Consider this MEMM, and enough training data to perfectly model it:
Pr(0123|rib)=1
Pr(0453|rob)=1
Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3
= 0.5 * 1 * 1
Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’
= 0.5 * 1 *1
Another view of label bias [Sha & Pereira]
So what’s the alternative?
CMMs to CRFs
Pr( y1... yn | x1...xn )   Pr( y j | y j 1, x j )  
j

 Z ( x )
i
Z ( x j )
j
exp(  i Fi ( x, y ))
i
exp(  i f i ( x j , y j , y j 1 ))
, where Fi ( x, y )   f i ( x j , y j , y j 1 )
j
j
j
New model
exp(  i Fi ( x, y ))
i
Z  ( x)
What’s the new model look like?
exp(  i Fi ( x, y ))
i

exp(  i  f i ( x j , y j , y j 1 )
i
Z  ( x)
j
Z  ( x)
What’s independent?
y1
y2
y3
x1
x2
x3
Graphical comparison among
HMMs, MEMMs and CRFs
HMM
MEMM
CRF
Sha & Pereira results
CRF beats MEMM
(McNemar’s test); MEMM
probably beats voted
perceptron
Sha & Pereira results
in minutes, 375k examples
Combining Sliding Windows and FiniteState Models
Problems with Tagging-based NER
• Decisions are made token-by-token
• Feature engineering can be awkward:
– How long is the extracted name?
– Does the extracted name appear in a dictionary?
– Is the extracted name similar to (an acronym for, a
short version of...) a name appearing earlier in the
document?
• “Tag engineering” is also crucial:
– Do we use one state per entity? a special state for
beginning an entity? a special state for end?
Semi-Markov models for IE
with Sunita Sarawagi, IIT Bombay
• Train on sequences of
labeled segments, not
labeled words.
S=(start,end,label)
• Build probability model
of segment sequences,
not word sequences
• Define features f of
segments
• (Approximately)
optimize feature weights
on training data
f(S) = words xt...xu, length, previous
words, case information, ..., distance to
known name
m
maximize:
 log Pr(S
i 1
i
| xi )
Segments vs tagging
t
1
x
y
Fred
Person
2
please
other
3
stop
other
4
by
other
5
6
my
office
Loc
Loc
7
this
8
afternoon
other
Time
t4=u4=7
t5=u5=8
f(xt,yt)
t,u
x
y
t1=u1=1
Fred
Person
t2=2, u2=4
please
stop
other
t3=5,u3=6
by
my
office
Loc
f(xj,yj)
this
other
afternoon
Time
A training algorithm for CSMM’s (1)
Review: Collins’ perceptron training algorithm
Correct tags
Viterbi tags
A training algorithm for CSMM’s (2)
Variant of Collins’ perceptron training algorithm:
voted
perceptron
learner for
TTRANS
like Viterbi
A training algorithm for CSMM’s (3)
Variant of Collins’ perceptron training algorithm:
voted
perceptron
learner for
TSEGTRANS
like Viterbi
Results
Results: vs CRF
Results: vs CRF
Broader Issues in IE
Broader View
Up to now we have been focused on segmentation and classification
Create ontology
Spider
Filter by relevance
IE
Segment
Classify
Associate
Cluster
Load DB
Document
collection
Train extraction models
Label training data
Database
Query,
Search
Data mine
Broader View
Now touch on some other issues
3 Create ontology
Spider
Filter by relevance
Tokenize
1
2
IE
Segment
Classify
Associate
Cluster
Load DB
Document
collection
4 Train extraction models
Label training data
Database
Query,
Search
5 Data mine
(1) Association as Binary Classification
Christos Faloutsos conferred with Ted Senator, the KDD 2003 General Chair.
Person
Person
Role
Person-Role (Christos Faloutsos, KDD 2003 General Chair)  NO
Person-Role (
Ted Senator,
KDD 2003 General Chair)  YES
Do this with SVMs and tree kernels over parse trees.
[Zelenko et al, 2002]
(1) Association with Finite State Machines
[Ray & Craven, 2001]
… This enzyme, UBC6,
localizes to the endoplasmic
reticulum, with the catalytic
domain facing the cytosol. …
DET
N
N
V
PREP
ART
ADJ
N
PREP
ART
ADJ
N
V
ART
N
this
enzyme
ubc6
localizes
to
the
endoplasmic
reticulum
with
the
catalytic
domain
facing
the
cytosol
Subcellular-localization (UBC6, endoplasmic reticulum)
(1) Association with Graphical Models
Capture arbitrary-distance
dependencies among
predictions.
Random variable
over the class of
entity #2, e.g. over
{person, location,…}
[Roth & Yih 2002]
Random variable
over the class of
relation between
entity #2 and #1,
e.g. over {lives-in,
is-boss-of,…}
Local language
models contribute
evidence to relation
classification.
Local language
models contribute
evidence to entity
classification.
Dependencies between classes
of entities and relations!
Inference with loopy belief propagation.
(1) Association with Graphical Models
[Roth & Yih 2002]
Also capture long-distance
dependencies among
predictions.
Random variable
over the class of
entity #1, e.g. over
{person, location,…}
person
lives-in
location
Local language
models contribute
evidence to entity
classification.
Random variable
over the class of
relation between
entity #2 and #1,
e.g. over {lives-in,
is-boss-of,…}
Local language
models contribute
evidence to relation
classification.
Dependencies between classes
of entities and relations!
Inference with loopy belief propagation.
Broader View
Now touch on some other issues
3 Create ontology
Spider
Filter by relevance
Tokenize
1
2
IE
Segment
Classify
Associate
Cluster
Load DB
Document
collection
4 Train extraction models
Label training data
Database
Query,
Search
5 Data mine
When do two extracted strings
refer to the same object?
(2) Information Integration
[Minton, Knoblock, et al 2001], [Doan, Domingos, Halevy 2001],
[Richardson & Domingos 2003]
Goal might be to merge results of two IE systems:
Name:
Number:
Introduction to
Computer Science
Title:
Intro. to Comp. Sci.
Num:
101
Dept:
Computer Science
Teacher:
Dr. Klüdge
CS 101
Teacher:
M. A. Kludge
Time:
9-11am
TA:
John Smith
Name:
Data Structures in
Java
Topic:
Java Programming
Room:
5032 Wean Hall
Start time:
9:10 AM
(2) Other Information Integration Issues
• Distance metrics for text – which work well?
– [Cohen, Ravikumar, Fienberg, 2003]
• Finessing integration by soft database
operations based on similarity
– [Cohen, 2000]
• Integration of complex structured databases:
(capture dependencies among multiple
merges)
– [Cohen, MacAllister, Kautz KDD 2000; Pasula,
Marthi, Milch, Russell, Shpitser, NIPS 2002;
McCallum and Wellner, KDD WS 2003]
IE Resources
• Data
– RISE, http://www.isi.edu/~muslea/RISE/index.html
– Linguistic Data Consortium (LDC)
• Penn Treebank, Named Entities, Relations, etc.
– http://www.biostat.wisc.edu/~craven/ie
• Papers, tutorials, lectures, code
– http://www.cs.cmu.edu/~wcohen/10-707
References
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
[Bikel et al 1997] Bikel, D.; Miller, S.; Schwartz, R.; and Weischedel, R. Nymble: a high-performance learning name-finder. In
Proceedings of ANLP’97, p194-201.
[Califf & Mooney 1999], Califf, M.E.; Mooney, R.: Relational Learning of Pattern-Match Rules for Information Extraction, in
Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99).
[Cohen, Hurst, Jensen, 2002] Cohen, W.; Hurst, M.; Jensen, L.: A flexible learning system for wrapping tables and lists in HTML
documents. Proceedings of The Eleventh International World Wide Web Conference (WWW-2002)
[Cohen, Kautz, McAllester 2000] Cohen, W; Kautz, H.; McAllester, D.: Hardening soft information sources. Proceedings of the
Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000).
[Cohen, 1998] Cohen, W.: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual
Similarity, in Proceedings of ACM SIGMOD-98.
[Cohen, 2000a] Cohen, W.: Data Integration using Similarity Joins and a Word-based Information Representation Language,
ACM Transactions on Information Systems, 18(3).
[Cohen, 2000b] Cohen, W. Automatically Extracting Features for Concept Learning from the Web, Machine Learning:
Proceedings of the Seventeeth International Conference (ML-2000).
[Collins & Singer 1999] Collins, M.; and Singer, Y. Unsupervised models for named entity classification. In Proceedings of the
Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.
[De Jong 1982] De Jong, G. An Overview of the FRUMP System. In: Lehnert, W. & Ringle, M. H. (eds), Strategies for Natural
Language Processing. Larence Erlbaum, 1982, 149-176.
[Freitag 98] Freitag, D: Information extraction from HTML: application of a general machine learning approach, Proceedings of the
Fifteenth National Conference on Artificial Intelligence (AAAI-98).
[Freitag, 1999], Freitag, D. Machine Learning for Information Extraction in Informal Domains. Ph.D. dissertation, Carnegie Mellon
University.
[Freitag 2000], Freitag, D: Machine Learning for Information Extraction in Informal Domains, Machine Learning 39(2/3): 99-101
(2000).
Freitag & Kushmerick, 1999] Freitag, D; Kushmerick, D.: Boosted Wrapper Induction. Proceedings of the Sixteenth National
Conference on Artificial Intelligence (AAAI-99)
[Freitag & McCallum 1999] Freitag, D. and McCallum, A. Information extraction using HMMs and shrinakge. In Proceedings
AAAI-99 Workshop on Machine Learning for Information Extraction. AAAI Technical Report WS-99-11.
[Kushmerick, 2000] Kushmerick, N: Wrapper Induction: efficiency and expressiveness, Artificial Intelligence, 118(pp 15-68).
[Lafferty, McCallum & Pereira 2001] Lafferty, J.; McCallum, A.; and Pereira, F., Conditional Random Fields: Probabilistic Models
for Segmenting and Labeling Sequence Data, In Proceedings of ICML-2001.
[Leek 1997] Leek, T. R. Information extraction using hidden Markov models. Master’s thesis. UC San Diego.
[McCallum, Freitag & Pereira 2000] McCallum, A.; Freitag, D.; and Pereira. F., Maximum entropy Markov models for information
extraction and segmentation, In Proceedings of ICML-2000
[Miller et al 2000] Miller, S.; Fox, H.; Ramshaw, L.; Weischedel, R. A Novel Use of Statistical Parsing to Extract Information from
Text. Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL), p. 226 - 233.
References
•
•
•
•
•
•
•
•
[Muslea et al, 1999] Muslea, I.; Minton, S.; Knoblock, C. A.: A Hierarchical Approach to Wrapper Induction. Proceedings of
Autonomous Agents-99.
[Muslea et al, 2000] Musclea, I.; Minton, S.; and Knoblock, C. Hierarhical wrapper induction for semistructured information
sources. Journal of Autonomous Agents and Multi-Agent Systems.
[Nahm & Mooney, 2000] Nahm, Y.; and Mooney, R. A mutually beneficial integration of data mining and information extraction. In
Proceedings of the Seventeenth National Conference on Artificial Intelligence, pages 627--632, Austin, TX.
[Punyakanok & Roth 2001] Punyakanok, V.; and Roth, D. The use of classifiers in sequential inference. Advances in Neural
Information Processing Systems 13.
[Ratnaparkhi 1996] Ratnaparkhi, A., A maximum entropy part-of-speech tagger, in Proc. Empirical Methods in Natural Language
Processing Conference, p133-141.
[Ray & Craven 2001] Ray, S.; and Craven, Ml. Representing Sentence Structure in Hidden Markov Models for Information
Extraction. Proceedings of the 17th International Joint Conference on Artificial Intelligence, Seattle, WA. Morgan Kaufmann.
[Soderland 1997]: Soderland, S.: Learning to Extract Text-Based Information from the World Wide Web. Proceedings of the Third
International Conference on Knowledge Discovery and Data Mining (KDD-97).
[Soderland 1999] Soderland, S. Learning information extraction rules for semi-structured and free text. Machine Learning,
34(1/3):233-277.
Download