From Unstructured to Structured Data

advertisement
FROM UNSTRUCTURED
INFORMATION TO LINKED
DATA
Axel Ngonga
Head of SIMBA@AKSW
University of Leipzig
IASLOD, August 15/16th 2012
Motivation
Motivation
• Where does the LOD Cloud come from?
• Structured data
• Triplify, D2R
• Semi-structured data
• DBpedia
• Unstructured data
• ???
• Unstructured data make up 80% of the Web
• How do we extract Linked Data from unstructured data
sources?
Overview
1. Problem Definition
2. Named Entity Recognition
• Algorithms
• Ensemble Learning
NB:
Will
be
mainly
concerned
3. Relation Extraction
Generalthe
approaches
with
newest developments.
•
• OpenIE approaches
4. Entity Disambiguation
• URI Lookup
• Disambiguation
5. Conclusion
Overview
1. Problem Definition
2. Named Entity Recognition
• Algorithms
• Ensemble Learning
3. Relation Extraction
• General approaches
• OpenIE approaches
4. Entity Disambiguation
• URI Lookup
• Disambiguation
5. Conclusion
Problem Definition
• Simple(?) problem: given a text fragment, retrieve
• All entities and
• relations between these entities automatically plus
• „ground them“ in an ontology
• Also coined Knowledge Extraction
John Petrucci was born in New York.
:John_Petrucci dbo:birthPlace :New_York .
:John_Petrucci
:New_York
dbo:birthPlace
Problems
1. Finding entities
 Named Entity Recognition
2. Finding relation instances
 Relation Extraction
3. Finding URIs
 URI Disambiguation
Overview
1. Problem Definition
2. Named Entity Recognition
• Algorithms
• Ensemble Learning
3. Relation Extraction
• General approaches
• OpenIE approaches
4. Entity Disambiguation
• URI Lookup
• Disambiguation
5. Conclusion
Named Entity Recognition
• Problem definition: Given a set of classes, find all strings
that are labels of instances of these classes within a text
fragment
John Petrucci was born in New York.
[John Petrucci, PER] was born in [New York, LOC].
Named Entity Recognition
• Problem definition: Given a set of classes, find all strings
that are labels of instances of these classes within a text
fragment
• Common sets of classes
• CoNLL03: Person, Location, Organization, Miscelleaneous
• ACE05: Facility, Geo-Political Entity, Location, Organisation,
Person, Vehicle, Weapon
• BioNLP2004: Protein, DNA, RNA, cell line, cell type
• Several approaches
• Direct solutions (single algorithms)
• Ensemble Learning
NER: Overview of approaches
• Dictionary-based
• Hand-crafted Rules
• Machine Learning
• Hidden Markov Model (HMMs)
• Conditional Random Fields (CRFs)
• Neural Networks
• k Nearest Neighbors (kNN)
• Graph Clustering
• Ensemble Learning
• Veto-Based (Bagging, Boosting)
• Neural Networks
NER: Dictionary-based
• Simple Idea
1. Define mappings between words and classes, e. g.,
Paris  Location
2. Try to match each token from each sentence
3. Return the mapping entities
 Time-Efficient at runtime
× Manuel creation of gazeteers
× Low Precision (Paris = Person, Location)
× Low Recall (esp. on Persons and Organizations as the
number of instances grows)
NER: Rule-based
• Simple Idea
1. Define a set of rule to find entities, e.g.,
[PERSON] was born in [LOCATION].
2. Try to match each sentence to one or several rules
3. Return the mapping entities
 High precision
× Manuel creation of rules is very tedious
× Low recall (finite number of patterns)
NER: Markov Models
• Stochastic process such that (Markov Property)
𝑃(𝑋𝑡+1 = 𝑥𝑡+1 |𝑋1 = 𝑥1 , … , 𝑋𝑛 = 𝑥𝑛 ) =
𝑃(𝑋𝑡+1 = 𝑥𝑡+1 |𝑋𝑛 = 𝑥𝑛 )
• Equivalent to finite-state machine
• Formally consists of
• Set S of states S1, … , Sn
• Matrix M such that mij =
P(Xt+1=Sj|Xt=Si)
NER: Hidden Markov Models
• Extension of Markov Models
• States are hidden and assigned an output function
• Only output is seen
• Transitions are learned from training data
S0
PER
S1
_
• How do they work?
• Input: Discrete sequence of features
(e.g., POS Tags, word stems, etc.)
• Goal: Find the best sequence of states
that represent the input
• Output: hopefully right classification
of each token
…
Sn
LOC
NER: k Nearest Neighbors
• Idea
• Describe each token
q from a labelled training data set with a set
of features (e.g., left and right neigbors)
• Each new token
• Assign
t is described with the same features
t the class of its k nearest neighbors
NER: So far …
• „Simple approaches“
• Apply one algorithm to the NER problem
• Bound to be limited by assumptions of model
• Implemented by a large number of tools
• Alchemy
• Stanford NER
• Illinois Tagger
• Ontos NER Tagger
• LingPipe
•…
NER: Ensemble Learning
• Intuition: Each algorithm has its strengths and
weaknesses
• Idea: Use ensemble learning to merge results of different
algorithms so as to create a meta-classifier of higher
accuracy
NER: Ensemble Learning
• Idea: Merge the results of several approaches for
improving results
• Simplest approaches:
• Voting
• Weighted voting
Output
Merger
System 1
System 2
Input
System n
NER: Ensemble Learning
• When does it work?
• Accuracy
• Need for exisiting solutions to be „good“
• Merging random results lead to random results
• Given, current approaches reach 80% F-Score
• Diversity
• Need for smallest possible amount of correlation
between approaches
• E.g., merging two HMM-based taggers won‘t help
• Given, large number of approaches for NER
NER:FOX
• Federated Knowledge Extraction Framework
• Idea: Apply ensemble learning to NER
• Classical approach: Voting
• Does not make use of systematic error
• Partly difficult to train
• Use neural networks instead
• Can make use of systematic
errory
• Easy to train
• Converge fast
• http://fox.aksw.org
NER: FOX
NER: FOX on MUC7
NER: FOX on MUC7
NER: FOX on Website Data
NER: FOX on Website Data
NER: FOX on Companies and Countries
 No runtime issues (parallel implementation)
 NN overhead is small
× Overfitting
NER: Summary
• Large number of approaches
• Dictionaries
• Hand-Crafted rules
• Machine Learning
• Hybrid
•…
 Combining approaches leads to better results than
single algorithms
Overview
1. Problem Definition
2. Named Entity Recognition
• Algorithms
• Ensemble Learning
3. Relation Extraction
• General approaches
• OpenIE approaches
4. Entity Disambiguation
• URI Lookup
• Disambiguation
5. Conclusion
RE: Problem Definition
• Find the relations between NEs if such relations exist.
• NEs not always given a-priori (open vs. closed RE)
John Petrucci was born in New York.
[John Petrucci, PER] was born in [New York, LOC].
bornIn ([John Petrucci, PER], [New York, LOC]).
RE: Approaches
• Hand-crafted rules
• Pattern Learning
• Coupled Learning
RE: Pattern-based
• Hearst patterns [Hearst: COLING‘92]
• POS-enhanced regular expression matching in natural-
language text
NP0 {,} such as {NP1, NP2, … (and|or) }{,} NPn
NP0 {,}{NP1, NP2, … NPn-1}{,} or other NPn
“The bow lute, such as the Bambara ndang, is plucked
and has an individual curved neck for each string.”
 isA(“Bambara ndang”, “bow lute”)
 Time-Efficient at runtime
× Very low recall
× Not adaptable to other relations
RE: DIPRE
• DIPRE = Dual Iterative Pattern Relation Extraction
• Semi-supervised, iterative gathering of facts and patterns
• Positive & negative examples as seeds for a given target
relation
Bill)
• (Hillary,
e.g. +(Hillary,
Bill) ; +(Carla, Nicolas); –(Larry, Google)
X and her husband Y
(Carla, Nicolas)
• Various
tuning parameters for pruning
low-confidence
X and Y on their honeymoon
patterns and facts
(Angelina, Brad)
• Extended to SnowBall / QXtract
(Victoria, David)
(Hillary, Bill)
(Carla, Nicolas)
(Larry, Google)
…
X and Y and their children
X has been dating with Y
X loves Y
RE: NELL
• Never-Ending Language Learner (http://rtw.ml.cmu.edu/)
• Open IE with ontological backbone
• Closed set of categories & typed
athletePlaysForTeam
(Athlete, SportsTeam)
relations
athletePlaysForTeam
• Seeds/counter seeds (5-10)
(Alex Rodriguez, Yankees)
• Open set of predicate arguments
athletePlaysForTeam
(instances)
(Alexander_Ovechkin,
• Coupled iterative learners
Penguins)
• Constantly running over a large Web corpus
since January 2010 (200 Mio pages)
• Periodic human supervision
RE: NELL
Conservative strategy
 Avoid Semantic Drift
RE: BOA
• Bootstrapping Linked Data (http://boa.aksw.org)
• Core idea: Use instance data in Data Web to discover NL
patterns and new instances
RE: BOA
• Follows conservative strategy
• Only top pattern
• Frequency threshold
• Score Threshold
• Evaluation results
RE: Summary
• Several approaches
• Hand-crafted rules
• Machine Learning
• Hybrid
 Large number of instances available for many relations
 Runtime problem  Parallel implementations
 Many new facts can be found
× Semantic Drift
× Long tail
× Entity Disambiguation
Overview
1. Problem Definition
2. Named Entity Recognition
• Algorithms
• Ensemble Learning
3. Relation Extraction
• General approaches
• OpenIE approaches
4. Entity Disambiguation
• URI Lookup
• Disambiguation
5. Conclusion
ED: Problem Definition
• Given (a) refence knowledge base(s), a text fragment, a
list of NEs (incl. position), and a list a relations, find URIs
for each of the NEs and relations
• Very difficult problem
• Ambiguity, e.g., Paris = Paris Hilton? Paris (France)?
• Difficult even for humans, e.g.,
• Paris‘ mayor died yesterday
• Several solutions
• Indexing
• Surface Form
• Graph-based
ED: Problem Definition
John Petrucci was born in New York.
:New_York
.
[John:John_Petrucci
Petrucci, PER]dbo:birthPlace
was born in [New
York, LOC].
bornIn ([John Petrucci, PER], [New York, LOC]).
ED: Indexing
• More retrieval than disambiguation
• Similar to dictionary-based approaches
• Idea
• Index all labels in reference knowledge base
• Given an input label, retrieve all entities with a similar
label
× Poor recall (unknown surface form, e.g., „Mme Curie“ für
„Marie Curie“)
× Low precision (Paris = Paris Hilton, Paris (France), …)
ED: Type Disambiguation
• Extension of indexing
• Index all labels
• Infer type information
• Retrieve labels from entities of the given type
• Same recall as previous approach
• Higher precision
• Paris[LOC] != Paris[PER]
• Still, Paris (France) vs. Paris (Ontario)
• Need for context
ED: Spotlight
• Known surface forms (http://dbpedia.org/spotlight)
• Based on DBpedia + Wikipedia
• Uses supplementary knowledge including disambiguation
pages, redirects, wikilinks
• Three main steps
• Spotting: Finding possible mentions of DBpedia
resources, e.g.,
John Petrucci was born in New York.
• Candidate Selection: Find possible URIs, e.g.,
John Petrucci  :JohnPetrucci
New York  :New_York, :New_York_County, …
• Disambiguation: Map context to vector for each resource
New York  :New_York
ED: YAGO2
• Joint Disambiguation
♬
Mississippi, one of Bob’s later songs, was first recorded by Sheryl on her album.
ED: YAGO2
Mentions of Entities
sim(cxt(ml ),cxt(ei ))
Entity Candidates
Mississippi (Song)
Mississippi (State)
Bob Dylan Songs
Sheryl Cruz
Sheryl Lee
prior(ml ,ei )
Objective: Maximize objective function (e.g., total weight)
Constraint: Keep at least one entity per mention
Sheryl Crow
coh(ei ,ej )
49
ED: FOX
• Generic Approach
• A-priori score (a): Popularity of URIs
• Similarity score (s): Similarity of resource labels and text
• Coherence score (z): Correlation between URIs
z
a|s
a|s
50
ED:FOX
• Allows the use of several algorithms
• HITS
• Pagerank
• Apriori
• Propagation Algorithms
•…
ED: Summary
• Difficult problem even for humans
• Several approaches
• Simple search
• Search with restrictions
• Known surface forms
• Graph-based
 Improved F-Score for DBpedia (70-80%)
× Low F-Score for generic knowledge bases
× Intrinsically difficult
× Still a lot to do
Overview
1. Problem Definition
2. Named Entity Recognition
• Algorithms
• Ensemble Learning
3. Relation Extraction
• General approaches
• OpenIE approaches
4. Entity Disambiguation
• URI Lookup
• Disambiguation
5. Conclusion
Conclusion
• Discussed basics of …
• Knowledge Extraction problem
• Named Entity Recognition
• Relation Extraction
• Entity Disambiguation
• Still a lot of research necessary
• Ensemble and active Learning
• Entity Disambiguation
• Question Answering …
Thank You!
Questions?
Download