FROM UNSTRUCTURED INFORMATION TO LINKED DATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16th 2012 Motivation Motivation • Where does the LOD Cloud come from? • Structured data • Triplify, D2R • Semi-structured data • DBpedia • Unstructured data • ??? • Unstructured data make up 80% of the Web • How do we extract Linked Data from unstructured data sources? Overview 1. Problem Definition 2. Named Entity Recognition • Algorithms • Ensemble Learning NB: Will be mainly concerned 3. Relation Extraction Generalthe approaches with newest developments. • • OpenIE approaches 4. Entity Disambiguation • URI Lookup • Disambiguation 5. Conclusion Overview 1. Problem Definition 2. Named Entity Recognition • Algorithms • Ensemble Learning 3. Relation Extraction • General approaches • OpenIE approaches 4. Entity Disambiguation • URI Lookup • Disambiguation 5. Conclusion Problem Definition • Simple(?) problem: given a text fragment, retrieve • All entities and • relations between these entities automatically plus • „ground them“ in an ontology • Also coined Knowledge Extraction John Petrucci was born in New York. :John_Petrucci dbo:birthPlace :New_York . :John_Petrucci :New_York dbo:birthPlace Problems 1. Finding entities Named Entity Recognition 2. Finding relation instances Relation Extraction 3. Finding URIs URI Disambiguation Overview 1. Problem Definition 2. Named Entity Recognition • Algorithms • Ensemble Learning 3. Relation Extraction • General approaches • OpenIE approaches 4. Entity Disambiguation • URI Lookup • Disambiguation 5. Conclusion Named Entity Recognition • Problem definition: Given a set of classes, find all strings that are labels of instances of these classes within a text fragment John Petrucci was born in New York. [John Petrucci, PER] was born in [New York, LOC]. Named Entity Recognition • Problem definition: Given a set of classes, find all strings that are labels of instances of these classes within a text fragment • Common sets of classes • CoNLL03: Person, Location, Organization, Miscelleaneous • ACE05: Facility, Geo-Political Entity, Location, Organisation, Person, Vehicle, Weapon • BioNLP2004: Protein, DNA, RNA, cell line, cell type • Several approaches • Direct solutions (single algorithms) • Ensemble Learning NER: Overview of approaches • Dictionary-based • Hand-crafted Rules • Machine Learning • Hidden Markov Model (HMMs) • Conditional Random Fields (CRFs) • Neural Networks • k Nearest Neighbors (kNN) • Graph Clustering • Ensemble Learning • Veto-Based (Bagging, Boosting) • Neural Networks NER: Dictionary-based • Simple Idea 1. Define mappings between words and classes, e. g., Paris Location 2. Try to match each token from each sentence 3. Return the mapping entities Time-Efficient at runtime × Manuel creation of gazeteers × Low Precision (Paris = Person, Location) × Low Recall (esp. on Persons and Organizations as the number of instances grows) NER: Rule-based • Simple Idea 1. Define a set of rule to find entities, e.g., [PERSON] was born in [LOCATION]. 2. Try to match each sentence to one or several rules 3. Return the mapping entities High precision × Manuel creation of rules is very tedious × Low recall (finite number of patterns) NER: Markov Models • Stochastic process such that (Markov Property) 𝑃(𝑋𝑡+1 = 𝑥𝑡+1 |𝑋1 = 𝑥1 , … , 𝑋𝑛 = 𝑥𝑛 ) = 𝑃(𝑋𝑡+1 = 𝑥𝑡+1 |𝑋𝑛 = 𝑥𝑛 ) • Equivalent to finite-state machine • Formally consists of • Set S of states S1, … , Sn • Matrix M such that mij = P(Xt+1=Sj|Xt=Si) NER: Hidden Markov Models • Extension of Markov Models • States are hidden and assigned an output function • Only output is seen • Transitions are learned from training data S0 PER S1 _ • How do they work? • Input: Discrete sequence of features (e.g., POS Tags, word stems, etc.) • Goal: Find the best sequence of states that represent the input • Output: hopefully right classification of each token … Sn LOC NER: k Nearest Neighbors • Idea • Describe each token q from a labelled training data set with a set of features (e.g., left and right neigbors) • Each new token • Assign t is described with the same features t the class of its k nearest neighbors NER: So far … • „Simple approaches“ • Apply one algorithm to the NER problem • Bound to be limited by assumptions of model • Implemented by a large number of tools • Alchemy • Stanford NER • Illinois Tagger • Ontos NER Tagger • LingPipe •… NER: Ensemble Learning • Intuition: Each algorithm has its strengths and weaknesses • Idea: Use ensemble learning to merge results of different algorithms so as to create a meta-classifier of higher accuracy NER: Ensemble Learning • Idea: Merge the results of several approaches for improving results • Simplest approaches: • Voting • Weighted voting Output Merger System 1 System 2 Input System n NER: Ensemble Learning • When does it work? • Accuracy • Need for exisiting solutions to be „good“ • Merging random results lead to random results • Given, current approaches reach 80% F-Score • Diversity • Need for smallest possible amount of correlation between approaches • E.g., merging two HMM-based taggers won‘t help • Given, large number of approaches for NER NER:FOX • Federated Knowledge Extraction Framework • Idea: Apply ensemble learning to NER • Classical approach: Voting • Does not make use of systematic error • Partly difficult to train • Use neural networks instead • Can make use of systematic errory • Easy to train • Converge fast • http://fox.aksw.org NER: FOX NER: FOX on MUC7 NER: FOX on MUC7 NER: FOX on Website Data NER: FOX on Website Data NER: FOX on Companies and Countries No runtime issues (parallel implementation) NN overhead is small × Overfitting NER: Summary • Large number of approaches • Dictionaries • Hand-Crafted rules • Machine Learning • Hybrid •… Combining approaches leads to better results than single algorithms Overview 1. Problem Definition 2. Named Entity Recognition • Algorithms • Ensemble Learning 3. Relation Extraction • General approaches • OpenIE approaches 4. Entity Disambiguation • URI Lookup • Disambiguation 5. Conclusion RE: Problem Definition • Find the relations between NEs if such relations exist. • NEs not always given a-priori (open vs. closed RE) John Petrucci was born in New York. [John Petrucci, PER] was born in [New York, LOC]. bornIn ([John Petrucci, PER], [New York, LOC]). RE: Approaches • Hand-crafted rules • Pattern Learning • Coupled Learning RE: Pattern-based • Hearst patterns [Hearst: COLING‘92] • POS-enhanced regular expression matching in natural- language text NP0 {,} such as {NP1, NP2, … (and|or) }{,} NPn NP0 {,}{NP1, NP2, … NPn-1}{,} or other NPn “The bow lute, such as the Bambara ndang, is plucked and has an individual curved neck for each string.” isA(“Bambara ndang”, “bow lute”) Time-Efficient at runtime × Very low recall × Not adaptable to other relations RE: DIPRE • DIPRE = Dual Iterative Pattern Relation Extraction • Semi-supervised, iterative gathering of facts and patterns • Positive & negative examples as seeds for a given target relation Bill) • (Hillary, e.g. +(Hillary, Bill) ; +(Carla, Nicolas); –(Larry, Google) X and her husband Y (Carla, Nicolas) • Various tuning parameters for pruning low-confidence X and Y on their honeymoon patterns and facts (Angelina, Brad) • Extended to SnowBall / QXtract (Victoria, David) (Hillary, Bill) (Carla, Nicolas) (Larry, Google) … X and Y and their children X has been dating with Y X loves Y RE: NELL • Never-Ending Language Learner (http://rtw.ml.cmu.edu/) • Open IE with ontological backbone • Closed set of categories & typed athletePlaysForTeam (Athlete, SportsTeam) relations athletePlaysForTeam • Seeds/counter seeds (5-10) (Alex Rodriguez, Yankees) • Open set of predicate arguments athletePlaysForTeam (instances) (Alexander_Ovechkin, • Coupled iterative learners Penguins) • Constantly running over a large Web corpus since January 2010 (200 Mio pages) • Periodic human supervision RE: NELL Conservative strategy Avoid Semantic Drift RE: BOA • Bootstrapping Linked Data (http://boa.aksw.org) • Core idea: Use instance data in Data Web to discover NL patterns and new instances RE: BOA • Follows conservative strategy • Only top pattern • Frequency threshold • Score Threshold • Evaluation results RE: Summary • Several approaches • Hand-crafted rules • Machine Learning • Hybrid Large number of instances available for many relations Runtime problem Parallel implementations Many new facts can be found × Semantic Drift × Long tail × Entity Disambiguation Overview 1. Problem Definition 2. Named Entity Recognition • Algorithms • Ensemble Learning 3. Relation Extraction • General approaches • OpenIE approaches 4. Entity Disambiguation • URI Lookup • Disambiguation 5. Conclusion ED: Problem Definition • Given (a) refence knowledge base(s), a text fragment, a list of NEs (incl. position), and a list a relations, find URIs for each of the NEs and relations • Very difficult problem • Ambiguity, e.g., Paris = Paris Hilton? Paris (France)? • Difficult even for humans, e.g., • Paris‘ mayor died yesterday • Several solutions • Indexing • Surface Form • Graph-based ED: Problem Definition John Petrucci was born in New York. :New_York . [John:John_Petrucci Petrucci, PER]dbo:birthPlace was born in [New York, LOC]. bornIn ([John Petrucci, PER], [New York, LOC]). ED: Indexing • More retrieval than disambiguation • Similar to dictionary-based approaches • Idea • Index all labels in reference knowledge base • Given an input label, retrieve all entities with a similar label × Poor recall (unknown surface form, e.g., „Mme Curie“ für „Marie Curie“) × Low precision (Paris = Paris Hilton, Paris (France), …) ED: Type Disambiguation • Extension of indexing • Index all labels • Infer type information • Retrieve labels from entities of the given type • Same recall as previous approach • Higher precision • Paris[LOC] != Paris[PER] • Still, Paris (France) vs. Paris (Ontario) • Need for context ED: Spotlight • Known surface forms (http://dbpedia.org/spotlight) • Based on DBpedia + Wikipedia • Uses supplementary knowledge including disambiguation pages, redirects, wikilinks • Three main steps • Spotting: Finding possible mentions of DBpedia resources, e.g., John Petrucci was born in New York. • Candidate Selection: Find possible URIs, e.g., John Petrucci :JohnPetrucci New York :New_York, :New_York_County, … • Disambiguation: Map context to vector for each resource New York :New_York ED: YAGO2 • Joint Disambiguation ♬ Mississippi, one of Bob’s later songs, was first recorded by Sheryl on her album. ED: YAGO2 Mentions of Entities sim(cxt(ml ),cxt(ei )) Entity Candidates Mississippi (Song) Mississippi (State) Bob Dylan Songs Sheryl Cruz Sheryl Lee prior(ml ,ei ) Objective: Maximize objective function (e.g., total weight) Constraint: Keep at least one entity per mention Sheryl Crow coh(ei ,ej ) 49 ED: FOX • Generic Approach • A-priori score (a): Popularity of URIs • Similarity score (s): Similarity of resource labels and text • Coherence score (z): Correlation between URIs z a|s a|s 50 ED:FOX • Allows the use of several algorithms • HITS • Pagerank • Apriori • Propagation Algorithms •… ED: Summary • Difficult problem even for humans • Several approaches • Simple search • Search with restrictions • Known surface forms • Graph-based Improved F-Score for DBpedia (70-80%) × Low F-Score for generic knowledge bases × Intrinsically difficult × Still a lot to do Overview 1. Problem Definition 2. Named Entity Recognition • Algorithms • Ensemble Learning 3. Relation Extraction • General approaches • OpenIE approaches 4. Entity Disambiguation • URI Lookup • Disambiguation 5. Conclusion Conclusion • Discussed basics of … • Knowledge Extraction problem • Named Entity Recognition • Relation Extraction • Entity Disambiguation • Still a lot of research necessary • Ensemble and active Learning • Entity Disambiguation • Question Answering … Thank You! Questions?