ekaw2006- tutorial

advertisement
Human Language Technology (HLT) and
Knowledge Acquisition for the Semantic
Web: a Tutorial
Diana Maynard (University of Sheffield)
Julien Nioche (University of Sheffield)
Marta Sabou (Vrije Universiteit Amsterdam)
Johanna Völker (AIFB)
Atanas Kiryakov (Ontotext Lab, Sirma AI)
EKAW 2006
[This work has been supported by
SEKT (http://sekt.semanticweb.org/) and
KnowledgeWeb (http://knowledgeweb.semanticweb.org/ ]
Structure of the Tutorial
1.
2.
3.
4.
5.
6.
7.
8.
Motivation, background
GATE overview
Information Extraction
GATE’s HLT components
IE and the Semantic Web
Ontology learning with Text2Onto
Focused ontology learning
Massive Semantic Annotation
2
Aims of this tutorial
• Investigates some technical aspects of HLT for
the SW and brings this methodology closer to
non-HLT experts
• Provides an introduction to an HLT toolkit
(GATE)
• Demonstrates using HLT for automating SWspecific knowledge acquisition tasks such as:
– Semantic annotation
– Ontology learning
– Ontology population
3
Some Terminology
• Semantic annotation – annotate in the texts all
mentions of instances relating to concepts in the
ontology
• Ontology learning – automatically derive an
ontology from texts
• Ontology population – given an ontology,
populate the concepts with instances derived
automatically from a text
4
Semantic Annotation: Motivation
• Semantic metadata extraction and annotation is the
glue that ties ontologies into document spaces
• Metadata is the link between knowledge and its
management
• Manual metadata production cost is too high
• State-of-the-art in automatic annotation needs
extending to target ontologies and scale to industrial
document stores and the web
5
Challenge of the Semantic Web
• The Semantic Web requires machine
processable, repurposable data to complement
hypertext
• Once metadata is attached to documents, they
become much more useful and more easily
processable, e.g. for categorising, finding
relevant information, and monitoring
• Such metadata can be divided into two types of
information: explicit and implicit.
6
Metadata extraction
• Explicit metadata extraction involves
information describing the document, such as
that contained in the header information of
HTML documents (titles, abstracts, authors,
creation date, etc.)
• Implicit metadata extraction involves semantic
information deduced from the text, i.e.
endogenous information such as names of
entities and relations contained in the text. This
essentially involves Information Extraction
techniques, often with the help of an ontology.
7
Ontology Learning and Population:
Motivation
• Creating and populating ontologies manually is a
very time-consuming and labour-intensive task
• It requires both domain and ontology experts
• Manually created ontologies are generally not
compatible with other ontologies, so reduce
interoperability and reuse
• Manual methods are impossible with very large
amounts of data
8
Semantic Annotation vs Ontology
Population
• Semantic Annotation
– Mentions of instances in the text are annotated wrt
concepts (classes) in the ontology.
– Requires that instances are disambiguated.
– It is the text which is modified.
• Ontology Population
– Generates new instances in an ontology from a text.
– Links unique mentions of instances in the text to
instances of concepts in the ontology.
– Instances must be not only disambiguated but also
co-reference between them must be established.
– It is the ontology which is modified.
9
Structure of the Tutorial
1.
2.
3.
4.
5.
6.
7.
8.
Motivation, background
GATE overview
Information Extraction
GATE’s HLT components
IE and the Semantic Web
Ontology learning with Text2Onto
Focused ontology learning
Massive Semantic Annotation
10
GATE : an open source framework
for HLT
• GATE (General Architecture for Text
Engineering) is a framework for language
processing (http://gate.ac.uk)
• Open Source (LGPL licence)
• Hosted on SourceForge
http://sourceforge.net/projects/gate
• Ten years old (!), with 1000s of users at 100s of
sites
• Current version 3.1
11
4 sides to the story
• An architecture: A macro-level organisational
picture for HLT software systems.
• A framework: For programmers, GATE is an
object-oriented class library that implements
the architecture.
• A development environment: For language
engineers, computational linguists et al, a
graphical development environment.
• A community of users and contributors
12
Architectural principles
• Non-prescriptive, theory neutral
(strength and weakness)
• Re-use, interoperation, not reimplementation
(e.g. diverse XML support, integration of Protégé,
Jena, Yale...)
• (Almost) everything is a component, and
component sets are user-extendable
• (Almost) all operations are available both from API
and GUI
13
All the world’s a Java Bean....
CREOLE: a Collection of REusable Objects for
Language Engineering:
• GATE components: modified Java Beans with
XML configuration
• The minimal component = 10 lines of Java, 10
lines of XML, 1 URL
Why bother?
• Allows the system to load arbitrary language
processing components
14
GATE APIs
PDF
docs
RTF
docs
HTML
docs
XML
docs
email
…
XML
Document
Format
HTML
Document
Format
PDF
Document
Format
…
Document
Format
Layer (LRs)
ADiff
OntolVR
DocVR
...
Document
NE
Co-ref
Annotation
Corpus Layer (LRs)
NOTES
•everything is a replaceable bean
•all communication via fixed APIs
•low coupling, high modularity,
high extensibility
TRs
Onto- Protégé WordOntology
net
logy
POS
…
Gazetteers
...
Language Resource Layer (LRs)
XML Oracle Postgre .ser
Sql
DataStore Layer
15
TEs
Processing Layer (PRs)
Document Annotation
Content
Set
Feature
Map
…
OBIE
Application Layer
IDE GUI Layer (VRs)
Corpus
ANNIE
GATE Users
• American National Corpus project
• Perseus Digital Library project, Tufts University, US
• Longman Pearson publishing, UK
• Merck KgAa, Germany
• Canon Europe, UK
• Knight Ridder, US
• BBN (leading HLT research lab), US
• SMEs: Melandra, SG-MediaStyle, ...
• a large number of other UK, US and EU Universities
• UK and EU projects inc. SEKT, PrestoSpace,
KnowledgeWeb, MyGrid, CLEF, Dot.Kom, AMITIES,
CubReporter, …
17
Past Projects using GATE
• MUMIS: conceptual indexing: automatic semantic
indices for sports video
• MUSE: multi-genre multilingual IE
• HSL: IE in domain of health and safety
• Old Bailey: IE on 17th century court reports
• Multiflora: plant taxonomy text analysis for biodiversity
research in e-science
• EMILLE: creation of S. Asian language corpus
• ACE / TIDES: IE competitions and collaborations in
English, Chinese, Arabic, Hindi
• h-TechSight: ontology-based IE and text mining
18
Current projects using GATE
•
•
•
•
•
ETCSL: Language tools for Sumerian digital library
SEKT: Semantic Knowledge Technologies
PrestoSpace: Preservation of audiovisual data
KnowledgeWeb: Semantic Web network of excellence
MEDIACAMPAIGN: Discovering, inter-relating and
navigating cross-media campaign knowledge
• TAO : Transitioning Applications to Ontologies
• MUSING : SW-based business intelligence tools
• NEON : Networked Ontologies
19
GATE
20
Structure of the Tutorial
1.
2.
3.
4.
5.
6.
7.
8.
Motivation, background
GATE overview
Information Extraction
GATE’s HLT components
IE and the Semantic Web
Ontology learning with Text2Onto
Focused ontology learning
Massive Semantic Annotation
21
IE is not IR
IR pulls documents
from large text
collections (usually the
Web) in response to
specific keywords or
queries. You analyse
the documents.
IE pulls facts and
structured information
from the content of large
text collections. You
analyse the facts.
22
IE for Document Access
• With traditional query engines, getting the
facts can be hard and slow
• Where has the Queen visited in the last year?
• Which places on the East Coast of the US have
had cases of West Nile Virus?
• Which search terms would you use to get this
kind of information?
• How can you specify you want someone’s
home page?
• IE returns information in a structured way
• IR returns documents containing the relevant
information somewhere (if you’re lucky)
23
HaSIE: an example application
• Application developed by University of Sheffield,
which aims to find out how companies report
about health and safety information
• Answers questions such as:
“How many members of staff died or had accidents
in the last year?”
“Is there anyone responsible for health and
safety?”
“What measures have been put in place to
improve health and safety in the workplace?”
24
HaSIE
• Identification of such information is too timeconsuming and arduous to be done manually.
• Each company report may be hundreds of
pages long.
• IR systems can’t help because they return whole
documents
• System identifies relevant sections of each
document, pulls out sentences about health and
safety issues, and populates a database with
relevant information
• This can then be analysed by an expert
25
HASIE
26
Named Entity Recognition: the
cornerstone of IE
• Identification of proper names in texts, and their
classification into a set of predefined categories
of interest
• Persons
• Organisations (companies, government
organisations, committees, etc)
• Locations (cities, countries, rivers, etc)
• Date and time expressions
• Various other types as appropriate
27
Why is NE important?
• NE provides a foundation from which to build
more complex IE systems
• Relations between NEs can provide tracking,
ontological information and scenario building
• Tracking (co-reference) “Dr Smith”, “John
Smith”, “John”, he”
• Ontologies “Athens, Georgia” vs “Athens,
Greece”
28
Two kinds of approaches
Knowledge Engineering
• rule based
• developed by experienced
language engineers
• make use of human
intuition
• require only small amount
of training data
• development can be very
time consuming
• some changes may be
hard to accommodate
Learning Systems
• use statistics or other
machine learning
• developers do not need
LE expertise
• require large amounts of
annotated training data
• some changes may
require re-annotation of
the entire training corpus
29
Typical NE pipeline
• Pre-processing (tokenisation, sentence
splitting, morphological analysis, POS tagging)
• Entity finding (gazeteer lookup, NE grammars)
• Coreference (alias finding, orthographic
coreference etc.)
• Export to database / XML
30
An Example
Ryanair announced yesterday that it will make Shannon its
next European base, expanding its route network to 14 in
an investment worth around €180m. The airline says it will
deliver 1.3 million passengers in the first year of the
agreement, rising to two million by the fifth year.
• Entities: Ryanair, Shannon
• Mentions: it=Ryanair, The airline=Ryanair, it=the airline
• Descriptions: European base
• Relations: Shannon base_of Ryanair
• Events: investment(€180m)
31
System development cycle
1.
2.
3.
4.
5.
Collect corpus of texts
Manually annotate gold standard
Develop system
Evaluate performance against gold standard
Return to step 3, until desired performance is
reached
32
Performance Evaluation
2 main requirements:
• Evaluation metric: mathematically defines how
to measure the system’s performance against
human-annotated gold standard
• Scoring program: implements the metric and
provides performance measures
– For each document and over the entire
corpus
– For each type of NE
33
Evaluation Metrics
• Most common are Precision and Recall
• Precision = correct answers/answers produced
• Recall = correct answers/total possible correct
answers
• Trade-off between precision and recall
• F1 (balanced) Measure = 2PR / 2(R + P)
• Some tasks sometimes use other metrics, e.g. costbased (good for application-specific adjustment)
• Ontology-based IE requires measures sensitive to
the ontology
34
GATE AnnotationDiff Tool
35
Corpus-level Regression Testing
• Need to track system’s performance over time
• When a change is made we want to know
implications over whole corpus
• Why: because an improvement in one case can
lead to problems in others
• GATE offers corpus benchmark tool, which
can compare different versions of the same
system against a gold standard
• This operates on a whole corpus rather than a
single document
36
Corpus Benchmark Tool
37
Structure of the Tutorial
1.
2.
3.
4.
5.
6.
7.
8.
Motivation, background
GATE overview
Information Extraction
GATE’s HLT components
IE and the Semantic Web
Ontology learning with Text2Onto
Focused ontology learning
Massive Semantic Annotation
38
GATE’s Rule-based System - ANNIE
• ANNIE – A Nearly-New IE system
• A version distributed as part of GATE
• GATE automatically deals with document
formats, saving of results, evaluation, and
visualisation of results for debugging
• GATE has a finite-state pattern-action rule
language - JAPE, used by ANNIE
• A reusable and easily extendable set of
components
39
What is ANNIE?
ANNIE is a vanilla information extraction
system comprising a set of core PRs:
– Tokeniser
– Gazetteers
– Sentence Splitter
– POS tagger
– Semantic tagger (JAPE transducer)
– Orthomatcher (orthographic coreference)
40
Core ANNIE Components
41
Re-using ANNIE
• Typically a new application will use most of the core
components from ANNIE
• The tokeniser, sentence splitter and orthomatcher are
basically language, domain and application-independent
• The POS tagger is language dependent but domain and
application-independent
• The gazetteer lists and JAPE grammars may act as a
starting point but will almost certainly need to be
modified
• You may also require additional PRs (either existing or
new ones)
42
DEMO of ANNIE and GATE GUI
•
•
•
•
•
Loading ANNIE
Creating a corpus
Loading documents
Running ANNIE on corpus
Demo
43
44
Gazetteers
• Gazetteers are plain text files containing lists of
names (e.g rivers, cities, people, …)
• Information used by JAPE rules
• Each gazetteer set has an index file listing all the
lists, plus features of each list (majorType,
minorType and language)
• Lists can be modified either internally using
Gaze, or externally in your favourite editor
• Gazetteers can also be mapped to ontologies
• Generates Lookup results of the given kind
45
46
47
JAPE grammars
• JAPE is a pattern-matching language
• The LHS of each rule contains patterns to
be matched
• The RHS contains details of annotations
(and optionally features) to be created
• The patterns in the corpus are identified
using ANNIC
48
Input specifications
The head of each grammar phase needs to
contain certain information
– Phase name
– Inputs
– Matching style
e.g.
Phase: location
Input: Token Lookup Number
Control: appelt
49
NE Rule in JAPE
Rule: Company1
Priority: 25
(
( {Token.orthography == upperInitial} )+ //from tokeniser
{Lookup.kind == companyDesignator} //from gazetteer lists
):match
-->
:match.NamedEntity = { kind=company, rule=“Company1” }
=> will match “Digital Pebble Ltd”
50
LHS of the rule
• LHS is expressed in terms of existing
annotations, and optionally features and their
values
• Any annotation to be used must be included in
the input header
• Any annotation not included in the input header
will be ignored (e.g. whitespace)
• Each annotation is enclosed in curly braces
• Each pattern to be matched is enclosed in round
brackets and has a label attached
51
Macros
• Macros look like the LHS of a rule but have no label
Macro: NUMBER
(({Digit})+)
• They are used in rules by enclosing the macro name in
round brackets
( (NUMBER)+):match
• Conventional to name macros in uppercase letters
• Macros hold across an entire set of grammar phases
52
Contextual information
• Contextual information can be specified in the
same way, but has no label
• Contextual information will be consumed by the
rule
({Annotation1})
({Annotation2}):match
({Annotation3})

53
RHS of the rule
• LHS and RHS are separated by 
• Label matches that on the LHS
• Annotation to be created follows the label
(Annotation1):match
 :match.NE = {feature1 = value1, feature2
= value2}
54
Example Rule for Dates
Macro: ONE_DIGIT
({Token.kind == number, Token.length == "1"})
Macro: TWO_DIGIT
({Token.kind == number, Token.length == "2"})
Rule: TimeDigital1
// 20:14:25
(
(ONE_DIGIT|TWO_DIGIT){Token.string == ":"} TWO_DIGIT
({Token.string == ":"} TWO_DIGIT)?
(TIME_AMPM)?
(TIME_DIFF)?
(TIME_ZONE)?
)
:time
-->
:time.TempTime = {kind = "positive", rule =
"TimeDigital1"}
55
Identifying patterns in corpora
• ANNIC – ANNotations In Context
• Provides a keyword-in-context-like interface for
identifying annotation patterns in corpora
• Uses JAPE LHS syntax, except that + and *
need to be quantified
• e.g. {Person}{Token}*3{Organisation} – find all
Person and Organisation annotations within up
to 3 tokens of each other
• To use, pre-process the corpus with ANNIE or
your own components, then query it via the GUI
56
ANNIC Demo
•
•
•
•
•
Formulating queries
Finding matches in the corpus
Analysing the contexts
Refining the queries
Demo
57
Using phases
• Grammars usually consist of several phases, run
sequentially
• A definition phase (conventionally called
main.jape) lists the phases to be used, in order
• Only the definition phase needs to be loaded
• Temporary annotations may be created in early
phases and used as input for later phases
• Annotations from earlier phases may need to be
combined or modified
58
59
Matching algorithms and Rule
Priority
• Rules compete within a single phase!
• 3 styles of matching:
– Brill (fire every rule that applies)
– First (shortest rule fires)
– Appelt (use of priorities)
• Appelt priority is applied in the following order
– Starting point of a pattern
– Longest pattern
– Explicit priority (default = -1)
60
61
Named Entities in GATE
Using co-reference
• Orthographic co-reference module matches proper
names in a document
• Improves results by assigning entity type to previously
unclassified names, based on relations with classified
entities
• May not reclassify already classified entities
• Classification of unknown entities very useful for
surnames which match a full name, or abbreviations,
e.g. [Bonfield] will match [Sir Peter Bonfield];
[International Business Machines Ltd.] will match
[IBM]
62
Named Entity Coreference
63
GATE 4.0
•
•
•
•
•
Before end 06
Faster and leaner!
Nicer GUI
ANNIC included
Improved Machine Learning API
(based on YALE)
• and more…
64
Structure of the Tutorial
1.
2.
3.
4.
5.
6.
7.
8.
Motivation, background
GATE overview
Information Extraction
GATE’s HLT components
IE and the Semantic Web
Ontology learning with Text2Onto
Focused ontology learning
Massive Semantic Annotation
65
Information Extraction for the Semantic
Web
• Traditional IE is based on a flat structure, e.g.
recognising Person, Location, Organisation,
Date, Time etc.
• For the Semantic Web, we need information in a
hierarchical structure
• Idea is that we attach semantic metadata to the
documents, pointing to concepts in an ontology
• Information can be exported as an ontology
annotated with instances, or as text annotated
with links to the ontology
66
Richer NE Tagging
• Attachment of
instances in the text to
concepts in the
domain ontology
• Disambiguation of
instances, e.g.
Cambridge, MA vs
Cambridge, UK
67
Magpie: an example
• Developed by the Open University
• Plugin for standard web browser
• Automatically associates an ontology-based
semantic layer to web resources, allowing
relevant services to be linked
• Provides means for a structured and informed
exploration of the web resources
• e.g. looking at a list of publications, we can find
information about an author such as projects
they work on, other people they work with, etc.
68
MAGPIE in action
69
MAGPIE in action
70
GATE and the Semantic Web
• Supports ontologies as part of IE applications Ontology-Based IE (OBIE)
• Supports semantic annotation and ontology
population
• Can combine learning and rule-based methods
• Allows combination of IE and IR
• Enables use of large-scale linguistic resources
for IE, such as WordNet
71
Ontology Management in GATE
72
Linking the Text to the Ontology
73
Exported Database
74
Evaluation for OBIE
• Traditional IE is evaluated in terms of Precision,
Recall and F-measure.
• But these are not sufficient for ontology-based
IE, because the distinction between right and
wrong is less obvious
• Recognising a Person as a Location is clearly
wrong, but recognising a Research Assistant as
a Lecturer is not so wrong
• Similarity metrics need to be integrated so that
items closer together in the hierarchy are given a
higher score, if wrong
75
Augmented Precision and Recall
• Development of a new BDM (Balanced
Distance Metric) which compares key
and response concepts wrt a given
ontology
• In the case of ontological mismatch,
provides an indication of how serious the
error is, and weights it accordingly
• BDM provides a score between 0 and 1 for
each key/response match instead of a
binary measure
76
Augmented Precision and Recall
BDM is integrated with traditional Precision and
Recall in the following way to produce a score
at the corpus level:
BDM
AR =
BDM + Missing
BDM
AP =
BDM + Spurious
77
Examples of misclassification
Entity
Response
Key
BDM
Sochi
Location
City
0.724
FBI
Org
GovOrg
0.959
Al-Jazeera
Org
TVCompany
0.783
Islamic Jihad
Company
ReligiousOrg 0.816
Brazil
Object
Country
0.587
Senate
Company
Political
Entity
0.826
78
Structure of the Tutorial
1.
2.
3.
4.
5.
6.
7.
8.
Motivation, background
GATE overview
Information Extraction
GATE’s HLT components
IE and the Semantic Web
Ontology learning with Text2Onto
Focused ontology learning
Massive Semantic Annotation
79
Ontology Learning with
Text2Onto
http://ontoware.org/projects/text2onto/
Johanna Völker
voelker@aifb.uni-karlsruhe.de
Institute AIFB
University of Karlsruhe
Agenda
• Ontology Learning
– Tasks
– Problems
• Text2Onto
–
–
–
–
–
Overview
Architecture
Linguistic preprocessing
Ontology learning approaches
Summary
81
Ontology Learning
• Extraction of (domain) ontologies from
natural language text
– Machine learning
– Natural language processing
• Tools: OntoLearn, OntoLT, ASIUM, Mo’K
Workbench, JATKE, TextToOnto, …
82
Ontology Learning – Tasks
Concept extraction
car, vehicle, person
Concept classification
subclass-of( car, vehicle )
Instance extraction
Peter, his-car
Instance classification
instance-of( Peter, person )
Relation extraction
drive( person, car )
Relation instance extraction
drive( Peter, his-car )
83
instance-of( Hewlett Packard, organization )
subclass-of( research, activity )
84
subclass-of( resource, knowledge )
reach( information, people )
address_in( issue, article )
85
Ontology Learning – Problems
Text Understanding
• Words are ambiguous
– ‘A bank is a financial institution. A bank is a piece of furniture.’
 subclass-of( bank, financial institution ) ?
• Natural Language is informal
– ‘The sea is water.’
 subclass-of( sea, water ) ?
• Sentences may be underspecified
– ‘Mary started the book.’
 read( Mary, book_1 ) ?
• Anaphores
– ‘Peter lives in Munich. This is a city in Bavaria.’
 instance-of( Munich, city ) ?
• Metaphores, …
86
Ontology Learning – Problems
Knowledge Modeling
• What is an instance / concept?
– ‘The koala is an animal living in Australia.’
 instance-of( koala, animal )
subclass-of( koala, animal ) ?
• How to deal with opinions and quoted speech?
– ‘Tom thinks that Peter loves Mary.’
 love( Peter, Mary ) ?
• Knowledge is changing
– instance-of( Pluto, planet ) ?
Conclusion:
• Ontology learning is difficult.
• What we can learn is fuzzy and uncertain.
• Ontology maintenance is important.
87
Text2Onto
• Support for (semi-)automatic ontology extraction from
natural language text
• Support for ontology maintenance and data-driven
ontology evolution by incremental ontology learning
• Model of Possible Ontologies (POM)
 Confidence / relevance values attached to all
concepts, instances and relations
• Enhanced user interaction
• Maintenance of multiple modeling alternatives in parallel
• Independence of certain ontology language
88
subclass-of( user, human ) / confidence 1.0
subclass-of( document, communication ) / confidence 0.75
89
Text2Onto – Evidence, Reference
and Change Management
• Explicit modeling of evidences
– Algorithms provide different types of evidences
– Explanation component
• References for annotation and change detection
• Explicit modeling of changes
– Corpus, evidence, reference and ontology changes
– Future work: ontology change strategies
90
Text2Onto – Workflow
Workflow composition
• Complex algorithms
– Different types of
algorithms for each
ontology learning task
– Flexible combination of
results
• Combination strategies
– minimum, maximum,
average, linear,
classifier, …
91
POM
Visualization
Workflow
Manager
API
Corpus
Evidence
Store
Reference
Store
POM
Ontology
GATE
Algorithm Controller
OWL
Writer
Text2Onto
92
RDFS
Writer
F-Logic
Writer
Linguistic Preprocessing
GATE
• Standard ANNIE components for
–
–
–
–
Tokenization
Sentence splitting
POS tagging
Stemming / lemmatizing
• Self-defined JAPE patterns and processing
resources for
– Stop word detection
– Shallow parsing
• GATE applications for English, German and
Spanish
93
Ontology Learning Approaches
Concept Classification
• Heuristics
– ‘image processing software’
subclass-of( image processing
software, software )
• Patterns
– ‘animals such as dogs’
– ‘dogs and other animals’
– ‘a dog is an animal’
 subclass-of( dog, animal )
94
JAPE Patterns for Ontology
Learning
rule: Hearst_1
(
(NounPhrase):superconcept
{SpaceToken.kind == space}
{Token.string=="such"}
{SpaceToken.kind == space}
{Token.string=="as"}
{SpaceToken.kind == space}
(NounPhrasesAlternatives):subconcept
):hearst1
-->
:hearst1.SubclassOfRelation = { rule = "Hearst1" },
:subconcept.Domain = { rule = "Hearst1" },
:superconcept.Range = { rule = "Hearst1" }
95
Ontology Learning Approaches
Instance Classification
• Context similarity
‘Columbus is the capital of the state of Ohio.
Columbus has a population of about 700.000
inhabitants.’
• Columbus ( capital (1), state (1), Ohio (1),
population (1), inhabitant (1) )
• city ( country (2), state (1), inhabitant (2), mayor
(1), attraction (1) )
• explorer( ship (1), sailor (2), discovery (1) )
 instance-of( Columbus, city )
96
Ontology Learning Approaches
Relation Extraction
• Subcategorization frames
– ‘Tina drives a Ford.’
• instance-of( Tina, person )
• instance-of( Ford, vehicle )
– ‘Her father drives a bus.’
• subclass-of( father, person )
• subclass-of( bus, vehicle )
subcat: drive( subj: person, obj: vehicle )
drive( person, vehicle )
97
incluyen( ontologiás, definiciones ) / confidence 1.0
98
Other Ontology Learning
Approaches
• WordNet
– Hyponym( ‘bank’, ‘institution’ )
 subclass-of( bank, institution ) ?
• Google
– ‘cities such as London’, ‘persons such as London’ …
– ‘such as London’
 instance-of( London, city ) ?
• Instance clustering
– Hierarchical clustering of context vectors
• Formal Concept Analysis (FCA)
– breathe( animal )
– breathe( human ), speak( human )
 subclass-of( human, animal ) ?
99
Summary
• Ontology Learning is difficult, because
– Language is fuzzy
– Knowledge is changing
• Text2Onto targets these Problems
– Model of Possible Ontologies
– Heterogeneous sources of evidence
– Incremental ontology learning
100
Thanks!
http://www.aifb.de/WBS/jvo/ontology-learning
http://www.ontoware.org/projects/text2onto
Structure of the Tutorial
1.
2.
3.
4.
5.
6.
7.
8.
Motivation, background
GATE overview
Information Extraction
GATE’s HLT components
IE and the Semantic Web
Ontology learning with Text2Onto
Focused ontology learning
Massive Semantic Annotation
102
Focused Ontology Learning
with GATE
A Practical Report on Learning Web Service Ontologies
Marta Sabou
Goal of the Talk
The goal of this talk is:
•To describe a Semantic Web relevant task: Focused
Ontology Learning.
•To exemplify this task in the context of Web Services.
•To show how focused ontology learning can be
implemented in GATE.
The focus of the talk is NOT ontology learning but
the elements of GATE that helped to perform this task.
Outline
1) Generic Problem:
* Focused Ontology Learning
(definition and characteristics)
2) Specific Problem:
* Learning Web Service Ontologies
(Context, Problem Scenario)
3) GATE support for:
* writing extraction patterns
* evaluating term extraction performance
Ontology Learning in Restricted Domains
Previous Talk’s conclusion:
Generic Ontology Learning is important but difficult because:
•Language is fuzzy
•Knowledge is changing
However...
The Semantic Web is increasingly used in specialized domains, where:
• Language exhibits (strong) domain characteristics
• e.g., mathematics, medicine
• The Knowledge to be extracted is defined by the task for
which the ontology will be used
• e.g., searching patient records, accessing drug related articles
Focused Ontology Learning:
• is Ontology Learning in a restricted domain, for a well-defined task
• therefore, simpler than Ontology Learning in general
• more and more frequent with the growth of the Semantic Web
106
Focused Ontology Learning
Focused Ontology Learning characteristics:
1. (Small) corpus with special (domain/context) characteristics;
2. Well defined ontological knowledge to be extracted;
3. An easy to detect correspondence between text characteristics
and ontology elements;
4. Usually an easy solution (adaptation of OL techniques);
5. Implemented/adapted by a non NLP-expert.
What is needed to support domain experts?
• libraries of basic NLP tools/data structures;
• tools to easily adapt/combine these NLP elements;
• intuitive way to create and debug own applications;
• usability plays an important role;
• generic methodologies of ontology learning rather than hard-coded
algorithms.
107
Outline
1) Generic Problem:
* Focused Ontology Learning
(definition and characteristics)
2) Specific Problem:
* Learning Web Service Ontologies
(Context, Problem Scenario)
3) GATE support for:
* writing extraction patterns (given)
* evaluating term extraction performance (given)
Context - Semantic Web Services
* Semantic WS - semantically annotated WS
* to automate discovery, composition, execution
< do:HotelBooking rdf:ID=”WS1">
<owls:hasInput rdf:resource=” do:Hotel ”/>
<owls:hasInput rdf:resource=” do:ReservationDates ”/>
<owls:hasOutput rdf:resource=” do:HotelReservation”/>
</do:HotelBooking>
=>broad domain coverage
But
…increasing nr. of web services
109
A real life story…
•Semantic Grid middleware to support in silico experiments in biology
•Bioinformatics programs are exposed as semantic web services
600
150
(Services)
(Services)
4 months!!
Domain Expert
550 Concepts
But only 125 (23%) used
for SWS tasks
Our GOAL:
Support Expert to learn:
1) From more services
2) In less time
3) A “Better” ontology (for SWS descriptions)
110
FOL Characteristics - 1
1. (Small) corpus with special (domain/context) characteristics
* Data Source:
* short descriptions of service functionalities
* characteristics:
* small corpora (100/200 documents)
* employ specific style (sublanguage)
•Replace or delete sequence sections.
•Find antigenic sites in proteins.
•Cai codon usage statistic.
111
FOL Characteristics - 2
2. Well defined ontology structure to be extracted
•Web Service Ontologies contain:
•A Data Structure hierarchy
•A Functionality hierarchy
112
FOL Characteristics - 3
3. An easy to detect correspondence between text
characteristics and ontology elements
Replace or delete sequence sections.
NP
VB_NP
113
FOL Characteristics - 4
4. Usually an easy solution (adaptation of OL techniques).E.g. Pos Tagging
Generic
Solution:
Linguistic Analysis
Implementation:
English Tokenizer
|Replace| |or| |delete| |sequence| ….
Sentence Splitter
POS Tagger
Extraction Patterns
JAPE Rules
Replace or delete sequence sections.
(VB) (Prep) (VB) (NN)
(NNS)
Replace or delete sequence sections.
(VB) (Prep) (VB) (NN)
(NNS)
r1 => (NP)
r2 => (Funct)
Ontology Building
OntologyBuilding&Pruning
Ontology Pruning
114
FOL Characteristics - 4
4. Usually an easy solution (adaptation of OL techniques).
E.g. Dependency Parsing
Linguistic Analysis
Minipar
Extraction Patterns
JAPE Rules
word
word
word
word
word
: 1 : replace : replace : V : * : i :
: 2 : or
: or
: U : 1 : lex-mod :
: 3 : delete
: delete
: V : 1 : lex-dep :
: 4 : sequence : sequence : N : 5 : nn :
: 5 : sections : section
: N : 1 : obj :
Replace or delete sequence sections.
(VB) (Prep) (VB) (NN)
(NNS)
r1 => (NP)
r2 => (Funct)
r2 => (Funct)
Ontology Building
Ontology Pruning
OntologyBuilding&Pruning
115
Outline
1) Generic Problem:
* Focused Ontology Learning
(definition and characteristics)
2) Specific Problem:
* Learning Web Service Ontologies
(Context, Problem Scenario)
3) GATE support for:
* writing extraction patterns
* evaluating term extraction performance
GATE Implementation
* Easy to follow extraction (step by step)
* Easy to adapt for domain engineers
117
Pattern based rules – Example
(
(DET)*:det
( (ADJ)|(NOUN))*:mods
(NOUN):hn
):np
:np.NP={}
A noun phrase consists of:
• zero or more determiners;
• zero or more modifiers which
can be adjectives or nouns;
• One noun which is the headnoun.
DET, ADJ, NOUN are macros – make rules more readable.
Macro: ADJ
( {Token.category == JJ, Token.kind == word}|
{Token.category == JJR, Token.kind == word}|
{Token.category == JJS, Token.kind == word} )
The ADJ macro identifies
any Token tagged as JJ,
JJR or JJS.
Extract NP(data) from NP(aaindex).
Displays NP(a non-overlapping wordmatch dotplot) of two NP(sequences)
118
Outline
1) Generic Problem:
* Focused Ontology Learning
(definition and characteristics)
2) Specific Problem:
* Learning Web Service Ontologies
(Context, Problem Scenario)
3) GATE support for:
* writing extraction patterns (given)
* evaluating term extraction performance (given)
Performance Evaluation
Linguistic Analysis
Extraction Patterns
Ontology Building
Ontology Pruning
A set of important terms are extracted.
Terms are indicated by annotations of type:
NP, Funct.
* The correctness of these terms has a direct influence on the
correctness of the OB step => evaluating them is important.
•The Corpus Benchmark Tool of GATE compares annotation
types in 2 corpora, usually:
• the manually annotated Gold Standard corpus and
• the automatically annotated corpus.
• It identifies correct, missed and spurious annotations of a
certain type and computes Precision and Recall per each
document and the whole corpus.
120
Performance Evaluation
Example 1:
Scan a sequence or database with a matrix or profile.
Gold Standard Annotations:
Automatic Annotation:
Funct(scan_sequence)
Funct(scan_database)
Funct(scan_sequence)
Funct(scan_database)
Funct(scan_profile)
105_profit.xml; Keys : 2Resp : 3
Annotation
Type
Precision
Recall
Funct
0.666666
1.0
Correct = correctly identified annotations (true positives)
Spurious = incorrect annotations (false positives)
121
Performance Evaluation
Example 2:
Preprocess the prints database for use with the program pscan.
Gold Standard Annotations:
Automatic Annotations:
Funct(preprocess_prints database)
104_printsextract.xml; Keys : 1Resp : 0
Annotation
Type
Precision
Recall
Funct
NaN
0.0
Missed = unidentified annotations (false negative)
122
Performance Evaluation
Statistics
Annotation
Type
Correct
Partially
Correct
Missing
Spurious
Precision
Recall
F-Measure
Funct
70
0
78
3
0.958904
0.47297
0.63348416
Extracted_Terms
spurious
Precision= correct/(All_Extr)
correct
missed
Recall= correct/(All_GS)
GoldStandard_Terms
123
Performance Evaluation
PROS:
•It is very important when developing term extraction.
•It allows evaluating:
•1) the performance of the linguistic analyses
•2) the coverage of the patterns
•Allows comparing the performance of different tools:
•E.g. two different POS taggers
•Easy to use (both from GUI and command line)
Possible improvement:
* The current textual output does not allow to directly access
all spurious or all missing annotations (these are important
when fine-tuning the extraction).
* We try to improve this usability issue through visualisation.
124
Summary
• Focused Ontology Learning = OL in a restricted domain.
• Example FOL = OL for Web Services.
• GATE supports the development of FOL in many ways:
• allows easy reuse and combination of basic NLP modules;
• offers software libraries for fundamental NLP data structures
(Documents, Corpora, Annotations);
• incorporates evaluation mechanisms;
• easy to debug and use for non-NLP experts.
125
Structure of the Tutorial
1.
2.
3.
4.
5.
6.
7.
8.
Motivation, background
GATE overview
Information Extraction
GATE’s HLT components
IE and the Semantic Web
Ontology learning with Text2Onto
Focused ontology learning
Massive Semantic Annotation
126
KIM Platform
An Overview
Atanas Kiryakov
Ontotext Lab, Sirma AI
naso@sirma.bg
http://www.ontotext.com/kim/
Semantic Annotation: An example
XYZ was established on 03 November 1978
in London. It opened a plant in Bulgaria
in …
Ontology & KB
Company
Location
HQ
City
type
XYZ
partOf
Country
type
HQ
type
London
establOn
type
partOf
“03/11/1978”
UK
128
Bulgaria
Semantic Annotation of NEs
A Semantic Annotation of the named entities (NEs) in a text includes:
- a recognition of the type of the entities in the text
-out of a rich taxonomy of classes (not a flat set of 10 types);
- an identification of the entities, which is also a reference to their
semantic description.
The traditional (IE-style) NE recognition approach results in:
<Person>Lama Ole Nydahl</Person>
The Semantic Annotation of NEs results in:
<ReligiousPerson ID=“http://..kim/Person111111”>
Lama Ole Nydahl
</ReligiousPerson>
129
Platforms for Large-Scale Semantic
Annotation
• Allow use of corpus-wide statistics to improve
metadata quality, e.g., disambiguation
• Automated alias discovery
• Generate SemWeb output (RDF, OWL)
• Stand-off storage and indexing of metadata
• Use large instance bases to disambiguate to
• Ontology servers for reasoning and access
• Architecture elements:
– Crawler, onto storage, doc indexing, query, annotators
– Apps: sem browsers, authoring tools, etc.
130
The KIM Platform
• A platform offering services and infrastructure for:
– (semi-) automatic semantic annotation and
– ontology population
– semantic indexing and retrieval of content
– query and navigation over the formal knowledge
• Based on an Information Extraction technology
• Aim: to arm Semantic Web applications
- by providing a metadata generation technology
- in a standard, consistent, and scalable
framework
131
KIM Architecture
Browser
Plug-in
Annotation
Server
Semantic
Annotation API
Custom IE
Any Web
Browser
Custom
Applications
Custom
Back-end
News
Collector
Index
API
Document
Persistence API
KIM
Server
Entity
Ranking
132
Query
API
KIM Web
UI
Semantic
Repository API
RMI
PROTON Ontology
- a light-weight upper-level
ontology;
- 250 NE classes;
- 100 relations and attributes;
- 200.000 entity descriptions;
- covers mostly NE classes, and
ignores general concepts;
- includes classes representing
lexical resources.
proton.semanticweb.org
133
KIM Scaling on Data
• The Semantic Repository is based on Sesame.
• Our practical tests demonstrate a good performance on
top of:
– 1.2M entity descriptions:
– about 15M explicit statements;
– above 30M statements after forward chaining.
• Document and annotation storage and indexing with
Lucene:
– .5M docs, processed on a $1000-worth machine;
– retrieval in milliseconds.
134
Simple Usage: Highlight, Hyperlink,
and …
135
Simple Usage: … Explore and
Navigate
136
How KIM Searches Better
KIM can match a Query:
Documents about a telecom company in Europe, John Smith, and
a date in the first half of 2002.
With a document containing:
“At its meeting on the 10th of May, the board of Vodafone
appointed John G. Smith as CTO"
The classical IR could not match:
- Vodafone with a "telecom in Europe“, because:
-
-
Vodafone is a mobile operator, which is a sort of a telecom;
Vodafone is in the UK, which is a part of Europe.
5th of May with a "date in first half of 2002“;
“John G. Smith” with “John Smith”.
137
Entity Pattern Search
138
Pattern Search: Entity Results
139
Entity Pattern Search: KIM
Explorer
140
Pattern Search, Referring
Documents
141
Document Details
142
Summary
KIM is a platform for:
- semantic annotation and ontology population,
- semantic indexing and retrieval,
- providing an API for remote access and integration,
- based on Information Extraction (IE) using GATE.
KIM is:
- Robust
- Scalable
- General-purpose, off the shelf platform!
143
THANK YOU!
(for not snoring)
The slides:
http://www.gate.ac.uk/sale/talks/ekaw2006/ekaw2006tutorial.ppt
[This work has been supported by
SEKT (http://sekt.semanticweb.org/)
and
KnowledgeWeb (http://knowledgeweb.semanticweb.org/ )]
144
Download