Information Extraction

advertisement
1
Information
Extraction
Can we automatically
Extract this information
From the text (instead
of depending on creators
To provide automated
annotations?)
2
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
NAME
TITLE
ORGANIZATION
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
IE
NAME
Bill Gates
Bill Veghte
Richard Stallman
TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
Tapping into the Collective Unconscious
How can we possibly do this without full NLP?
• Another thread of exciting research is driven by the realization that
WEB is not random at all!
– It is written by humans
– …so analyzing its structure and content allows us to tap into the collective
unconscious ..
• Meaning can emerge from syntactic notions such as “co-occurrences” and
“connectedness”
• Examples:
– Analyzing term co-occurrences in the web-scale corpora to capture
semantic information (today’s paper)
– Analyzing the link-structure of the web graph to discover communities
• DoD and NSA are very much into this as a way of breaking terrorist cells
– Analyzing the transaction patterns of customers (collaborative filtering)
“(Un)wrapping the wrapped results..”
Fielded IE Systems: Citeseer, Google Scholar; Libra
How do they do it? Why do they fail?

6
4/30
7
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment
Classify
Associate
Cluster
Load DB
Document
collection
Train extraction models
Label training data
Database
Query,
Search
Data mine
Slides from Cohen & McCallum
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + clustering + association
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Slides from Cohen & McCallum
Information Extraction vs. NLP?
• Information extraction is attempting to find
some of the structure and meaning in the
hopefully template driven web pages.
• As IE becomes more ambitious and text
becomes more free form, then ultimately we
have IE becoming equal to NLP.
• Web does give one particular boost to NLP
– Massive corpora..
16
MUC
• DARPA funded significant efforts in IE in the
early to mid 1990’s.
• Message Understanding Conference (MUC)
was an annual event/competition where
results were presented.
• Focused on extracting information from news
articles:
– Terrorist events
– Industrial joint ventures
– Company management changes
• Information extraction of particular interest to
the intelligence community (CIA, NSA).
17
What makes IE from the Web Different?
Less grammar, but more formatting & linking
Newswire
Web
www.apple.com/retail
Apple to Open Its First Retail Store
in New York City
MACWORLD EXPO, NEW YORK--July 17, 2002-Apple's first retail store in New York City will open in
Manhattan's SoHo district on Thursday, July 18 at
8:00 a.m. EDT. The SoHo store will be Apple's
largest retail store to date and is a stunning example
of Apple's commitment to offering customers the
world's best computer shopping experience.
www.apple.com/retail/soho
www.apple.com/retail/soho/theatre.html
"Fourteen months after opening our first retail store,
our 31 stores are attracting over 100,000 visitors
each week," said Steve Jobs, Apple's CEO. "We
hope our SoHo store will surprise and delight both
Mac and PC users who want to see everything the
Mac can do to enhance their digital lifestyles."
The directory structure, link structure,
formatting & layout of the Web is its own
new grammar.
Slides from Cohen & McCallum
Landscape of IE Tasks (1/4):
Pattern Feature Domain
Text paragraphs
without formatting
Grammatical sentences
and some formatting & links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.
Non-grammatical snippets,
rich formatting & links
Tables
Slides from Cohen & McCallum
Landscape of IE Tasks (2/4):
Pattern Scope
Web site specific
Formatting
Amazon.com Book Pages
Genre specific
Layout
Resumes
Wide, non-specific
Language
University Names
Slides from Cohen & McCallum
Landscape of IE Tasks (3/4):
Pattern Complexity
E.g. word patterns:
Closed set
Regular set
U.S. states
U.S. phone numbers
He was born in Alabama…
Phone: (413) 545-1323
The big Wyoming sky…
The CALD main office can be
reached at 412-268-1299
Complex pattern
U.S. postal addresses
University of Arkansas
P.O. Box 140
Hope, AR 71802
Headquarters:
1128 Main Street, 4th Floor
Cincinnati, Ohio 45210
Ambiguous patterns,
needing context + many
sources of evidence
Person names
…was among the six houses
sold by Hope Feldman that year.
Pawel Opalinski, Software
Engineer at WhizBang Labs.
Slides from Cohen & McCallum
Landscape of IE Tasks (4/4):
Pattern Combinations
Jack Welch will retire as CEO of General Electric tomorrow. The top role
at the Connecticut company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
Person: Jack Welch
Relation: Person-Title
Person: Jack Welch
Title:
CEO
Person: Jeffrey Immelt
Location: Connecticut
N-ary record
Relation:
Company:
Title:
Out:
In:
Succession
General Electric
CEO
Jack Welsh
Jeffrey Immelt
Relation: Company-Location
Company: General Electric
Location: Connecticut
“Named entity” extraction
Slides from Cohen & McCallum
Evaluation of Single Entity Extraction
TRUTH:
Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.
PRED:
Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.
# correctly predicted segments
Precision =
2
=
# predicted segments
6
# correctly predicted segments
Recall
=
2
=
# true segments
4
1
F1
=
Harmonic mean of Precision & Recall =
((1/P) + (1/R)) / 2
Slides from Cohen & McCallum
State of the Art Performance
• Named entity recognition
– Person, Location, Organization, …
– F1 in high 80’s or low- to mid-90’s
• Binary relation extraction
– Contained-in (Location1, Location2)
Member-of (Person1, Organization1)
– F1 in 60’s or 70’s or 80’s
• Wrapper induction
– Extremely accurate performance obtainable
– Human effort (~30min) required on each site
Slides from Cohen & McCallum
Landscape of IE Techniques (1/1):
Models
Classify Pre-segmented
Candidates
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama
Alaska
…
Wisconsin
Wyoming
Boundary Models
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
Classifier
which class?
which class?
Try alternate
window sizes:
Finite State Machines
Abraham Lincoln was born in Kentucky.
Context Free Grammars
Abraham Lincoln was born in Kentucky.
BEGIN
Most likely state sequence?
NNP
NNP
V
V
P
Classifier
PP
which class?
VP
NP
BEGIN
END
BEGIN
NP
END
VP
S
…and beyond
Any of these models can be used to capture words, formatting or
both.
Slides from Cohen & McCallum
Three Examples
• (un)wrappers
– That use path expressions on dom trees
• Pattern extractors
– That use path expressions on parse trees
• Context-based slot fillers
– That annotate words into an ontology with the
help of context surrounding them
26
Extraction from Templated Text
• Many web pages are generated automatically from an
underlying database.
• Therefore, the HTML structure of pages is fairly
specific and regular (semi-structured).
• However, output is intended for human consumption,
not machine interpretation.
• An IE system for such generated pages allows the web
site to be viewed as a structured database.
• An extractor for a semi-structured web site is
sometimes referred to as a wrapper.
• Process of extracting from such pages is sometimes
referred to as screen scraping.
28
Templated Extraction using DOM Trees
• Web extraction may be aided by first
parsing web pages into DOM trees.
• Extraction patterns can then be specified as
paths from the root of the DOM tree to the
node containing the text to extract.
• May still need regex patterns to identify
proper portion of the final CharacterData
node.
29
Sample DOM Tree Extraction
HTML
Element
HEADER
BODY
B
Can be “semi-automated”
Users show examples
and the program
remembers the
path expressions
Wrapper maintenance?
Cheap labor…
Age of Spiritual
Machines
Character-Data
FONT
by
A
Ray
Kurzweil
Title: HTMLBODYBCharacterData
Author: HTML BODYFONTA CharacterData
30
31
32
If there is cooperation
from the source,
an API can be established
removing the need for wrappers
Basis for many startups
like Junglee, Flipdog etc
33
Three Examples
• (un)wrappers
– That use path expressions on dom trees
• Pattern extractors
– That use path expressions on parse trees
• Context-based slot fillers
– That annotate words into an ontology with the
help of context surrounding them
•34
• If extracting from automatically generated web
pages, simple regex patterns usually work.
• If extracting from more natural, unstructured,
human-written text, some NLP may help.
– Part-of-speech (POS) tagging
• Mark each word as a noun, verb, preposition, etc.
– Syntactic parsing
• Identify phrases: NP, VP, PP
– Semantic word categories (e.g. from WordNet)
• KILL: kill, murder, assassinate, strangle, suffocate
• Off-the-shelf software available to do this!
– The “Brill” tagger
• Extraction patterns can use POS or phrase tags.
Analogy to regex patterns on DOM trees for structured tex
Extraction from Free Text involves
Natural Language Processing
35
I. Generate-and-Test Architecture
Generic extraction patterns (Hearst ’92):
• “…Cities such as Boston, Los Angeles, and Seattle…”
(“C such as NP1, NP2, and NP3”) =>
IS-A(each(head(NP)), C), …
•Detailed information for several countries
such as maps, …” ProperNoun(head(NP))
• “I listen to pretty much all music but prefer
Template
Driven
Extraction
(where template
In in terms of
Syntax Tree)
country such as Garth Brooks”
36
Assessing the fact accuracy
Recall “water flows upwards”
PMI(Seattle,Tomato)=1.5M/107M
~1%
Seattle is 20times more likely to be
a city than a tomato!
Assess candidate extractions using Mutual
Information (PMI-IR) (Turney ’01).
| Hits ( Seattle  City) |
PMI ( Seattle, City) 
| Hits ( Seattle) |
= 24.7M/107M ~23%
37
..but many things indicate “city”ness
Discriminator phrases fi :
“x is a city”
“x has a population of”
“x is the capital of y”
“x’s baseball team…”
| Hits ( I  D) |
PMI ( I , D) 
| Hits ( I ) |
•PMI = frequency of I & D co-occurrence
•5-50 discriminators Di
•Each PMI for Di is a feature fi
•Naïve Bayes evidence combination:
P ( | f 1 , f 2 ,... f n ) 
Keep the probablities
with the extracted facts
P ( )i P ( f i |  )
P ( )i P ( f i |  )  P ( )i P ( f i |  )
PMI is used for feature selection. NBC is used for learning. Hits used for assessing
38
PMI as well as conditional probabilities
Some Sources of ambiguity
•
•
•
•
Time: “Clinton is the president” (in 1996).
Context: “common misconceptions..”
Opinion: Elvis…
Multiple word senses: Amazon, Chicago,
Chevy Chase, etc.
– Dominant senses can mask recessive ones!
– Approach: unmasking. ‘Chicago –City’
40
Chicago
City
Movie
| Hits ( I  D | C ) |
PMI ( I , D, C ) 
| Hits ( I | C ) |
41
Chicago Unmasked
City sense
Movie sense
| Hits (Chicago  Movie  City) |
| Hits (Chicago  City) |
42
Impact of Unmasking on PMI
Name
Washington
Casablanca
Chevy Chase
Chicago
Recessive Original Unmask Boost
city
0.50
0.99
96%
city
0.41
0.93
127%
actor
0.09
0.58 512%
movie
0.02
0.21
972%
43
Three Examples
• (un)wrappers
– That use path expressions on dom trees
• Pattern extractors
– That use path expressions on parse trees
• Context-based slot fillers
– That annotate words into an ontology with the
help of context surrounding them
•48
Annotate base facts, given text and ontology
49
Annotation
“The Chicago Bulls announced yesterday that
Michael Jordan will. . . ”
The <resource ref="http://tap.stanford.edu/
BasketballTeam_Bulls">Chicago Bulls</resource>
announced yesterday that <resource ref=
"http://tap.stanford.edu/AthleteJordan,_Michael">
Michael Jordan</resource> will...’’
50
Semantic Annotation
Name Entity Identification
This simplest task of meta-data
extraction on NL is to establish “type”
relation between entities in the NL
resources and concepts in ontologies.
51
Picture from http://lsdis.cs.uga.edu/courses/SemWebFall2005/courseMaterials/CSCI8350-Metadata.ppt
Annotation
Current practice of annotation for knowledge identification and extraction
is time
consuming
needs annotation by
experts
is complex
Reduce burden of text
annotation for Knowledge
Management
55
www.racai.ro/EUROLAN-2003/html/presentations/SheffieldWilksBrewsterDingli/Eurolan2003AlexieiDingli.ppt
SemTag & Seeker


WWW-03 Best Paper Prize
Seeded with TAP ontology (72k concepts)



And ~700 human judgments
Crawled 264 million web pages
Extracted 434 million semantic tags

Automatically disambiguated
SemTag
• Uses broad, shallow knowledge base
• TAP – lexical and taxonomic information
about popular objects
– Music
– Movies
– Sports
– Etc.
58
SemTag
• Problem:
– No write access to original document, so how
do you annotate?
• Solution:
– Store annotations in a web-available
database
59
SemTag
• Semantic Label Bureau
– Separate store of semantic annotation
information
– HTTP server that can be queried for
annotation information
– Example
• Find all semantic tags for a given document
• Find all semantic tags for a particular object
60
SemTag
• Methodology
61
SemTag
•
Three phases
1.
Spotting Pass:
–
–
2.
Tokenize the document
All instances plus 20 word window
Learning Pass:
–
–
3.
Find corpus-wide distribution of terms at each internal node
of taxonomy
Based on a representative sample
Tagging Pass:
–
–
Scan windows to disambiguate each reference
Finally determined to be a TAP object
62
SemTag
•
Solution:
–
•
Taxonomy Based Disambiguation (TBD)
TBD expectation:
– Human tuned parameters used in small,
critical sections
– Automated approaches deal with bulk of
information
64
SemTag
• TBD methodology:
– Each node in the taxonomy is associated with
a set of labels
• Cats, Football, Cars all contain “jaguar”
– Each label in the text is stored with a window
of 20 words – the context
– Each node has an associated similarity
function mapping a context to a similarity
• Higher similarity  more likely to contain a
reference
65
SemTag
Is a context c appropriate for a node v
References inside the taxonomy vs.
References outside the taxonomy
Multiple nodes: b = r  b != p(v)
67
Summary
• Information extraction can be motivated either as
explicating more structure from the data or as an
automated way to Semantic Web
• Extraction complexity depends on whether the
text you have is “templated” or “free-form”
– Extraction from templated text can be done by regular
expressions
– Extraction from free form text requires NLP
• Can be done in terms of parts-of-speech-tagging
• “Annotation” involves connecting terms in a free
form text to items in the background knowledge
– It too can be automated
70
Download