Information Extraction and Integration: an Overview William W. Cohen Carnegie Mellon University

advertisement
Information Extraction
and Integration: an Overview
William W. Cohen
Carnegie Mellon University
Jan 12, 2004
Administrivia
• Course web page:
– www.cs.cmu.edu/~wcohen/10-707/ (or Google me)
• No class 1/28 or 2/2.
– Unless I get a volunteer?
• Classwork:
– Write ½ page on each paper being discussed.
• Starting next week.
– Present 1 or 2 “optional” papers.
– Do a course project.
Today’s lecture
• Overview of the task of information extraction.
• Overview of some methods for “named entity
extraction”:
– Sliding windows, boundary-finding: reduce NE to
classification.
– HMM, CMM, CRF: reduce NE to sequential classification
(an independently interesting problem)
• Overview of some methods for associating,
(grouping, clustering, querying, using, …) extracted
data.
Example: The Problem
Martin Baker, a person
Genomics job
Employers job posting form
Example: A Solution
Extracting Job Openings from the Web
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.htm
OtherCompanyJobs: foodscience.com-Job1
Category = Food Services
Keyword = Baker
Location = Continental U.S.
Job Openings:
Data Mining the Extracted Job Information
IE from Research Papers
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
NAME
TITLE
ORGANIZATION
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
IE
NAME
Bill Gates
Bill Veghte
Richard Stallman
TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + clustering + association
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
aka “named entity
Gates
extraction”
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
Free Software Foundation
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment
Classify
Associate
Cluster
Load DB
Document
collection
Train extraction models
Label training data
Database
Query,
Search
Data mine
Tutorial Outline
• IE History
• Landscape of problems and solutions
• Parade of models for segmenting/classifying:
–
–
–
–
Sliding window
Boundary finding
Finite state machines
Trees
• Overview of related problems and solutions
– Association, Clustering
– Integration with Data Mining
• Where to go from here
IE History
Pre-Web
• Mostly news articles
– De Jong’s FRUMP [1982]
• Hand-built system to fill Schank-style “scripts” from news wire
– Message Understanding Conference (MUC) DARPA [’87-’95],
TIPSTER [’92-’96]
• Early work dominated by hand-built models
– E.g. SRI’s FASTUS, hand-built FSMs.
– But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and
then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]
Web
• AAAI ’94 Spring Symposium on “Software Agents”
– Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.
• Tom Mitchell’s WebKB, ‘96
– Build KB’s from the Web.
• Wrapper Induction
– Initially hand-build, then ML: [Soderland ’96], [Kushmeric ’97],…
– Citeseer; Cora; FlipDog; contEd courses, corpInfo, …
IE History
Biology
• Gene/protein entity extraction
• Protein/protein fact interaction
• Automated curation/integration of databases
– At CMU: SLIF (Murphy et al, subcellular information
from images + text in journal articles)
Email
• EPCA, PAL, RADAR, CALO: intelligent office
assistant that “understands” some part of email
– At CMU: web site update requests, office-space
requests; calendar scheduling requests; social
network analysis of email.
IE is different in different domains!
Example: on web there is less grammar, but more formatting & linking
Newswire
Web
www.apple.com/retail
Apple to Open Its First Retail Store
in New York City
MACWORLD EXPO, NEW YORK--July 17, 2002-Apple's first retail store in New York City will open in
Manhattan's SoHo district on Thursday, July 18 at
8:00 a.m. EDT. The SoHo store will be Apple's
largest retail store to date and is a stunning example
of Apple's commitment to offering customers the
world's best computer shopping experience.
www.apple.com/retail/soho
www.apple.com/retail/soho/theatre.html
"Fourteen months after opening our first retail store,
our 31 stores are attracting over 100,000 visitors
each week," said Steve Jobs, Apple's CEO. "We
hope our SoHo store will surprise and delight both
Mac and PC users who want to see everything the
Mac can do to enhance their digital lifestyles."
The directory structure, link structure,
formatting & layout of the Web is its own
new grammar.
Landscape of IE Tasks (1/4):
Degree of Formatting
Text paragraphs
without formatting
Grammatical sentences
and some formatting & links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.
Non-grammatical snippets,
rich formatting & links
Tables
Landscape of IE Tasks (2/4):
Intended Breadth of Coverage
Web site specific
Formatting
Amazon.com Book Pages
Genre specific
Layout
Resumes
Wide, non-specific
Language
University Names
Landscape of IE Tasks (3/4):
Complexity
E.g. word patterns:
Closed set
Regular set
U.S. states
U.S. phone numbers
He was born in Alabama…
Phone: (413) 545-1323
The big Wyoming sky…
The CALD main office can be
reached at 412-268-1299
Complex pattern
U.S. postal addresses
University of Arkansas
P.O. Box 140
Hope, AR 71802
Headquarters:
1128 Main Street, 4th Floor
Cincinnati, Ohio 45210
Ambiguous patterns,
needing context and
many sources of evidence
Person names
…was among the six houses
sold by Hope Feldman that year.
Pawel Opalinski, Software
Engineer at WhizBang Labs.
Landscape of IE Tasks (4/4):
Single Field/Record
Jack Welch will retire as CEO of General Electric tomorrow. The top role
at the Connecticut company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
Person: Jack Welch
Relation: Person-Title
Person: Jack Welch
Title:
CEO
Person: Jeffrey Immelt
Location: Connecticut
“Named entity” extraction
Relation: Company-Location
Company: General Electric
Location: Connecticut
N-ary record
Relation:
Company:
Title:
Out:
In:
Succession
General Electric
CEO
Jack Welsh
Jeffrey Immelt
Evaluation of Single Entity Extraction
TRUTH:
Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.
PRED:
Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.
# correctly predicted segments
Precision =
2
=
# predicted segments
6
# correctly predicted segments
Recall
=
2
=
# true segments
4
1
F1
=
Harmonic mean of Precision & Recall =
((1/P) + (1/R)) / 2
State of the Art Performance: a sample
• Named entity recognition from newswire text
– Person, Location, Organization, …
– F1 in high 80’s or low- to mid-90’s
• Binary relation extraction
– Contained-in (Location1, Location2)
Member-of (Person1, Organization1)
– F1 in 60’s or 70’s or 80’s
• Wrapper induction
– Extremely accurate performance obtainable
– Human effort (~10min?) required on each site
Landscape of IE Techniques (1/1):
Models
Classify Pre-segmented
Candidates
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama
Alaska
…
Wisconsin
Wyoming
Boundary Models
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
Classifier
which class?
which class?
Try alternate
window sizes:
Finite State Machines
Abraham Lincoln was born in Kentucky.
Context Free Grammars
Abraham Lincoln was born in Kentucky.
BEGIN
Most likely state sequence?
NNP
NNP
V
V
P
Classifier
PP
which class?
VP
NP
BEGIN
END
BEGIN
NP
END
VP
S
Any of these models can be used to capture words, formatting or both.
Landscape
Pattern complexity
Pattern feature domain
Pattern scope
Pattern combinations
Models
closed set
words
regular
complex
words + formatting
site-specific
formatting
genre-specific
entity
binary
lexicon
regex
ambiguous
general
n-ary
window
boundary
FSM
Sliding Windows
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
E.g.
Looking for
seminar
location
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence
during the 1980s and 1990s.
As a result
of its success and growth, machine learning
is evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
A “Naïve Bayes” Sliding Window Model
[Freitag 1997]
…
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun …
w t-m
w t-1 w t
w t+n
w t+n+1
w t+n+m
prefix
contents
suffix
Estimate Pr(LOCATION|window) using Bayes rule
Try all “reasonable” windows (vary length, position)
Assume independence for length, prefix words, suffix words, content words
Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)
If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.
Other examples of sliding window: [Baluja et al 2000]
(decision tree over individual words & their context)
“Naïve Bayes” Sliding Window Results
Domain: CMU UseNet Seminar Announcements
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell
School of Computer Science
Carnegie Mellon University
3:30 pm
7500 Wean Hall
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during
the 1980s and 1990s.
As a result of its
success and growth, machine learning is
evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning), genetic
algorithms, connectionist learning, hybrid
systems, and so on.
Field
Person Name:
Location:
Start Time:
F1
30%
61%
98%
SRV: a realistic sliding-window-classifier
IE system
[Frietag AAAI ‘98]
• What windows to consider?
– all windows containing as many tokens as the shortest
example, but no more tokens than the longest example
• How to represent a classifier? It might:
– Restrict the length of window;
– Restrict the vocabulary or formatting used
before/after/inside window;
– Restrict the relative order of tokens;
– Use inductive logic programming techniques to express all
these…
<title>Course Information for CS213</title>
<h1>CS 213 C++ Programming</h1>
SRV: a rule-learner for sliding-window
classification
• Primitive predicates used by SRV:
– token(X,W), allLowerCase(W), numerical(W), …
– nextToken(W,U), previousToken(W,V)
• HTML-specific predicates:
– inTitleTag(W), inH1Tag(W), inEmTag(W),…
– emphasized(W) = “inEmTag(W) or inBTag(W) or …”
– tableNextCol(W,U) = “U is some token in the column
after the column W is in”
– tablePreviousCol(W,V), tableRowHeader(W,T),…
SRV: a rule-learner for sliding-window
classification
• Non-primitive “conditions” used by SRV:
–
–
–
–
every(+X, f, c) = for all W in X : f(W)=c
some(+X, W, <f1,…,fk>, g, c)= exists W: g(fk(…(f1(W)…))=c
tokenLength(+X, relop, c):
position(+W,direction,relop, c):
• e.g., tokenLength(X,>,4), position(W,fromEnd,<,2)
courseNumber(X) :tokenLength(X,=,2),
every(X, inTitle, false),
some(X, A, <previousToken>, inTitle, true),
some(X, B, <>. tripleton, true)
Non-primitive
conditions
make greedy
search easier
Rapier: an alternative approach
A bottom-up rule learner:
[Califf & Mooney, AAAI ‘99]
initialize RULES to be one rule per example;
repeat {
randomly pick N pairs of rules (Ri,Rj);
let {G1…,GN} be the consistent pairwise generalizations;
let G* = Gi that optimizes “compression”
let RULES = RULES + {G*} – {R’: covers(G*,R’)}
}
where compression(G,RULES) = size of RULES- {R’: covers(G,R’)} and
“covers(G,R)” means every example matching G matches R
Rapier: results – precision/recall
Rule-learning approaches to slidingwindow classification: Summary
• SRV, Rapier, and WHISK [Soderland KDD ‘97]
– Representations for classifiers allow restriction of the
relationships between tokens, etc
– Representations are carefully chosen subsets of even more
powerful representations based on logic programming (ILP
and Prolog)
– Use of these “heavyweight” representations is complicated,
but seems to pay off in results
• Some questions to consider:
– Can simpler, propositional representations for classifiers
work (see Roth and Yih)
– What learning methods to consider (NB, ILP, boosting,
semi-supervised – see Collins & Singer)
– When do we want to use this method vs fancier ones?
BWI: Learning to detect boundaries
[Freitag & Kushmerick, AAAI 2000]
• Another formulation: learn three probabilistic
classifiers:
– START(i) = Prob( position i starts a field)
– END(j) = Prob( position j ends a field)
– LEN(k) = Prob( an extracted field has length k)
• Then score a possible extraction (i,j) by
START(i) * END(j) * LEN(j-i)
• LEN(k) is estimated from a histogram
BWI: Learning to detect boundaries
Field
Person Name:
Location:
Start Time:
F1
30%
61%
98%
Problems with Sliding Windows
and Boundary Finders
• Decisions in neighboring parts of the input
are made independently from each other.
– Naïve Bayes Sliding Window may predict a
“seminar end time” before the “seminar start time”.
– It is possible for two overlapping windows to both
be above threshold.
– In a Boundary-Finding system, left boundaries are
laid down independently from right boundaries,
and their pairing happens as a separate step.
Finite State Machines
Hidden Markov Models
HMMs are the standard sequence modeling tool in
genomics, music, speech, NLP, …
Graphical model
Finite state model
S t-1
St
S t+1
...
...
observations
...
Generates:
State
sequence
Observation
sequence
transitions
O
Ot
t -1
O t +1

|o|
o1
o2
o3
o4
o5
o6
o7
o8
 
P( s , o )   P( st | st 1 ) P(ot | st )
S={s1,s2,…}
Start state probabilities: P(st )
Transition probabilities: P(st|st-1 )
t 1
Parameters: for all states
Usually a multinomial over
Observation (emission) probabilities: P(ot|st ) atomic, fixed alphabet
Training:
Maximize probability of training observations (w/ prior)
IE with Hidden Markov Models
Given a sequence of observations:
Yesterday Pedro Domingos spoke this example sentence.
and a trained HMM:
person name
location name
background
 
Find the most likely state sequence: (Viterbi) arg max s P( s , o )
Yesterday Pedro Domingos spoke this example sentence.
Any words said to be generated by the designated “person name”
state extract as a person name:
Person name: Pedro Domingos
HMM Example: “Nymble”
[Bikel, et al 1998],
[BBN “IdentiFinder”]
Task: Named Entity Extraction
Person
start-ofsentence
end-ofsentence
Org
Other
Train on ~500k words of news wire text.
Case
Mixed
Upper
Mixed
Observation
probabilities
P(st | st-1, ot-1 )
P(ot | st , st-1 )
or
(Five other name classes)
Results:
Transition
probabilities
Language
English
English
Spanish
P(ot | st , ot-1 )
Back-off to:
Back-off to:
P(st | st-1 )
P(ot | st )
P(st )
P(ot )
F1 .
93%
91%
90%
Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99]
We want More than an Atomic View of Words
Would like richer representation of text:
many arbitrary, overlapping features of the words.
S t-1
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is “Wisniewski”
is in a list of city names
is under node X in WordNet
part of
ends in
is in bold font
noun phrase
“-ski”
is indented
O t 1
is in hyperlink anchor
last person name was female
next two words are “and Associates”
St
S t+1
…
…
Ot
O t +1
Problems with Richer Representation
and a Generative Model
These arbitrary features are not independent.
– Multiple levels of granularity (chars, words, phrases)
– Multiple dependent modalities (words, formatting, layout)
– Past & future
Two choices:
Model the dependencies.
Each state would have its own
Bayes Net. But we are already
starved for training data!
Ignore the dependencies.
This causes “over-counting” of
evidence (ala naïve Bayes).
Big problem when combining
evidence, as in Viterbi!
S t-1
St
S t+1
S t-1
St
S t+1
O
Ot
O t +1
O
Ot
O t +1
t -1
t -1
Conditional Sequence Models
• We prefer a model that is trained to maximize a
conditional probability rather than joint probability:
P(s|o) instead of P(s,o):
– Can examine features, but not responsible for generating
them.
– Don’t have to explicitly model their dependencies.
– Don’t “waste modeling effort” trying to generate what we are
given at test time anyway.
Conditional Markov Models (CMMs) vs
HMMS
St-1
St
St+1
...
Pr( s, o)   Pr( si | si 1 ) Pr(oi | si 1 )
i
Ot-1
Ot
St-1
Ot+1
St
St+1
...
Pr( s | o)   Pr( si | si 1 , oi 1 )
i
Ot-1
Lots of ML ways to estimate Pr(y | x)
Ot
Ot+1
Conditional Finite State Sequence Models
[McCallum, Freitag & Pereira, 2000]
[Lafferty, McCallum, Pereira 2001]
From HMMs to CRFs

s  s1 , s2 ,...sn

o  o1 , o2 ,...on
St-1
St
St+1
...

|o|
Joint
 
P( s , o )   P( st | st 1 ) P(ot | st )
t 1
Conditional
 
P( s | o ) 
Ot-1
Ot
Ot+1
...

|o|
1
  P( st | st 1 ) P(ot | st )
P(o ) t 1
St-1
St
St+1
...

|o|
1
    s ( st , st 1 ) o (ot , st )
Z (o ) t 1

where  o (t )  exp 
Arbitrary features of s,o, and t

Ot-1
Ot
Ot+1
...

  k f k ( s t , ot ) 
k

(A super-special case of
Conditional Random Fields.)
Feature Functions

Example f k ( st , st 1 , o, t ) :
1 if Capitalize d(ot )  s i  st 1  s j  st

f Capitalized,si , s j  ( st , st 1 , o, t )  
0 otherwise
o = Yesterday Pedro Domingos spoke this example sentence.
o1
o2
s1
s2
s3
s4
o3
o4
o5
o6

f Capitalized , s1 , s3  ( s1 , s2 , o ,2)  1
o7
Learning Parameters of CRFs
Maximize log-likelihood of parameters   {k} given training data D

|o |
 1

2k
 
L   log    exp   k f k ( st , st 1 , o, t )     2
s , o D
 k
  k 2
 Z (o ) t 1
Log-likelihood gradient:
L

k
k
 
  (i )
  (i )
 #k (s , o )   P (s '| o ) #k (s ' , o )  2
 
s , o D
i


s'
 

where # k ( s , o )   f k (st 1 , st , o , t )
t
Methods:
• iterative scaling (quite slow – 2000 iterations from good start)
• gradient, conjugate gradient (faster)
• limited-memory quasi-Newton methods (“super fast”)
[Sha & Pereira 2002] & [Malouf 2002]
Voted Perceptron Sequence Models
 
Given trai ning data : { o , s
(i )
}
Initialize parameters to zero : k  k  0
[Collins 2001; also
Hofmann 2003, Taskar et al
2003]
Iterate to convergenc e :
for all training instances, i


 
sViterbi  arg max s  exp    k f k ( st , st 1 , o , t ) 
t
 k

 


k :  k   Ck ( s ( i ) , o ( i ) )  Ck ( sViterbi, o ( i ) )


 

 where Ck ( s , o )   f k (st 1 , st , o , t ) as before 
t


Avoids the tricky math; very fast; uses “pseudonegative” examples of sequences; approximates a
margin classifier for “good” vs “bad” sequences
Analogous to
the gradient
for this one
training instance
Broader Issues in IE
Broader View
Up to now we have been focused on segmentation and classification
Create ontology
Spider
Filter by relevance
IE
Segment
Classify
Associate
Cluster
Load DB
Document
collection
Train extraction models
Label training data
Database
Query,
Search
Data mine
Broader View
Now touch on some other issues
3 Create ontology
Spider
Filter by relevance
Tokenize
1
2
IE
Segment
Classify
Associate
Cluster
Load DB
Document
collection
4 Train extraction models
Label training data
Database
Query,
Search
5 Data mine
(1) Association as Binary Classification
Christos Faloutsos conferred with Ted Senator, the KDD 2003 General Chair.
Person
Person
Role
Person-Role (Christos Faloutsos, KDD 2003 General Chair)  NO
Person-Role (
Ted Senator,
KDD 2003 General Chair)  YES
Do this with SVMs and tree kernels over parse trees.
[Zelenko et al, 2002]
(1) Association with Finite State Machines
[Ray & Craven, 2001]
… This enzyme, UBC6,
localizes to the endoplasmic
reticulum, with the catalytic
domain facing the cytosol. …
DET
N
N
V
PREP
ART
ADJ
N
PREP
ART
ADJ
N
V
ART
N
this
enzyme
ubc6
localizes
to
the
endoplasmic
reticulum
with
the
catalytic
domain
facing
the
cytosol
Subcellular-localization (UBC6, endoplasmic reticulum)
(1) Association with Graphical Models
Capture arbitrary-distance
dependencies among
predictions.
Random variable
over the class of
entity #2, e.g. over
{person, location,…}
[Roth & Yih 2002]
Random variable
over the class of
relation between
entity #2 and #1,
e.g. over {lives-in,
is-boss-of,…}
Local language
models contribute
evidence to relation
classification.
Local language
models contribute
evidence to entity
classification.
Dependencies between classes
of entities and relations!
Inference with loopy belief propagation.
(1) Association with Graphical Models
[Roth & Yih 2002]
Also capture long-distance
dependencies among
predictions.
Random variable
over the class of
entity #1, e.g. over
{person, location,…}
person
lives-in
person?
Local language
models contribute
evidence to entity
classification.
Random variable
over the class of
relation between
entity #2 and #1,
e.g. over {lives-in,
is-boss-of,…}
Local language
models contribute
evidence to relation
classification.
Dependencies between classes
of entities and relations!
Inference with loopy belief propagation.
(1) Association with Graphical Models
[Roth & Yih 2002]
Also capture long-distance
dependencies among
predictions.
Random variable
over the class of
entity #1, e.g. over
{person, location,…}
person
lives-in
location
Local language
models contribute
evidence to entity
classification.
Random variable
over the class of
relation between
entity #2 and #1,
e.g. over {lives-in,
is-boss-of,…}
Local language
models contribute
evidence to relation
classification.
Dependencies between classes
of entities and relations!
Inference with loopy belief propagation.
Broader View
Now touch on some other issues
3 Create ontology
Spider
Filter by relevance
Tokenize
1
2
IE
Segment
Classify
Associate
Cluster
Load DB
Document
collection
4 Train extraction models
Label training data
Database
Query,
Search
5 Data mine
When do two extracted strings
refer to the same object?
(2) Learning a Distance Metric Between Records
[Borthwick, 2000; Cohen & Richman, 2001; Bilenko & Mooney, 2002, 2003]
Learn Pr ({duplicate, not-duplicate} | record1, record2)
with a Maximum Entropy classifier.
Do greedy agglomerative clustering using this Probability as a distance metric.
(2) String Edit Distance
• distance(“William Cohen”, “Willliam Cohon”)
s
W I
L L I
t
W I
L L L I
A M _ C O H O N
C
C
C
op
cost
C
C
I
A M _ C O H E N
C
C
C
C
C
C
S
C
0 0 0 0 1 1 1 1 1 1 1 1 2 2
(2) Computing String Edit Distance
D(i,j) = min
D(i-1,j-1) + d(si,tj) //subst/copy
D(i-1,j)+1
//insert
D(i,j-1)+1
//delete
C
A trace indicates
where the min
value came from,
and can be used to
find edit
operations and/or
a best alignment
M
(may be more than 1)
C
C
1
1
2
learn
these
parameters
O
H
E
N
2
3
4
5
2
3
4
5
3
3
4
5
3
4
5
O
3
2
H
4
3
N
5
4
2
3
3
3
4
3
(2) String Edit Distance Learning
[Bilenko & Mooney, 2002, 2003]
Precision/recall for MAILING dataset duplicate detection
(2) Information Integration
[Minton, Knoblock, et al 2001], [Doan, Domingos, Halevy 2001],
[Richardson & Domingos 2003]
Goal might be to merge results of two IE systems:
Name:
Number:
Introduction to
Computer Science
Title:
Intro. to Comp. Sci.
Num:
101
Dept:
Computer Science
Teacher:
Dr. Klüdge
CS 101
Teacher:
M. A. Kludge
Time:
9-11am
TA:
John Smith
Name:
Data Structures in
Java
Topic:
Java Programming
Room:
5032 Wean Hall
Start time:
9:10 AM
(2) Other Information Integration Issues
• Distance metrics for text – which work well?
– [Cohen, Ravikumar, Fienberg, 2003]
• Finessing integration by soft database
operations based on similarity
– [Cohen, 2000]
• Integration of complex structured databases:
(capture dependencies among multiple
merges)
– [Cohen, MacAllister, Kautz KDD 2000; Pasula,
Marthi, Milch, Russell, Shpitser, NIPS 2002;
McCallum and Wellner, KDD WS 2003]
Broader View
Now touch on some other issues
3 Create ontology
Spider
Filter by relevance
Tokenize
1
2
IE
Segment
Classify
Associate
Cluster
Load DB
Document
collection
4 Train extraction models
Database
Query,
Search
5 Data mine
Label training data
1
(5) Working with IE Data
• Some special properties of IE data:
– It is based on extracted text
– It is “dirty”, (missing extraneous facts, improperly normalized
entity names, etc.)
– May need cleaning before use
• What operations can be done on dirty, unnormalized
databases?
– Datamine it directly.
– Query it directly with a language that has “soft joins” across
similar, but not identical keys. [Cohen 1998]
– Use it to construct features for learners [Cohen 2000]
– Infer a “best” underlying clean database
[Cohen, Kautz, MacAllester, KDD2000]
Download