University of Sheffield, NLP

advertisement
University of Sheffield, NLP
Case study:
GATE in the NeOn project
Diana Maynard
University of Sheffield
University of Sheffield, NLP
Aims of this talk
• Demonstrates using GATE for automating SWspecific tasks such as semantic annotation and
ontology learning from texts
• SARDINE: pattern-based relation extraction in
the fisheries domain
• Adding new concepts and instances to the
ontology
• Finding relations between existing concepts in
the ontology
• SPRAT: generic version of SARDINE
University of Sheffield, NLP
Recap: IE for the Semantic Web
• Traditional IE is based on a flat structure, e.g.
recognising Person, Location, Organisation,
Date, Time etc.
• For the Semantic Web, we need information in a
hierarchical structure
• Idea is that we attach annotations to the
documents, pointing to concepts in an ontology
• Information can be exported as an ontology
annotated with instances
University of Sheffield, NLP
Linking the Text to the Ontology
University of Sheffield, NLP
The NeOn project
• NeOn (Networking Ontologies) is a 4-year 14.7
million Euro EU project involving 14 European
partners.
• Focus on using ontologies for large-scale
semantic applications in distributed organizations
• Handles multiple networked ontologies that
exist in a particular context, are created
collaboratively, and might be highly dynamic
and constantly evolving.
University of Sheffield, NLP
ODd SOFAS
• The Food and Agricultural Organisation of the UN have
odd sofas…..
6
University of Sheffield, NLP
Wall climbing sofa
University of Sheffield, NLP
Sofa made from bicycle seats
University of Sheffield, NLP
FAO Case Study
• Actually, it’s nothing to do with sofas, or any kind of
seating.
• They do, however, have an Ontology-driven stock
over-fishing alert system
• Focuses on agricultural sector and information
management for hunger prevention
• Case study aims at management of alerts to avoid
over-fishing in already stretched global waters
• Role of GATE is to analyse textual resources to find
new information such as new fish names, and
relations between ontology elements, e.g. “Atlantic
cod are fished in the Gulf of Maine”
University of Sheffield, NLP
SARDINE
Species Annotation, Recognition and Indexing of Named Entities
• SARDINE identify mentions of fish species from text
• It identifies
– existing fish names listed in the ontology and their
morphological variants
– potential new fish names not listed in the ontology
– potential relations between fish names
• For the new fish, it attempts to classify them in the ontology,
based on linguistic information such as synonyms and
hyponyms of existing fish
• It may generate properties also for existing fish in the ontology
10
University of Sheffield, NLP
11
University of Sheffield, NLP
12
University of Sheffield, NLP
Using patterns to find new fish
 Synonyms:
– mummichogs (fundulus heteroclitus)
 Names
appearing in lists:
– “plankton, herring and clams....”
– “clams, herring and other types of fish”
 More
specific fish names:
– Japanese flounder
– Red salmon
– Suberites sponges
13
University of Sheffield, NLP
Example of JAPE rule (1)
Example: “Suberites sponges” (where “sponge” is a known
class)
Rule: AdjClass
(
({Token.category == JJ})
({Class}):super
):sub
-->
:sub.SardineSubclass = {rule=AdjClass},
:super.SardineSuperclass = {rule=AdjClass},
…
University of Sheffield, NLP
Example of JAPE rule (2)
Example: “Frogs are a kind of amphibian.”
Rule:Subclass1
(
({NP}):sub
( {Lookup.minorType == be}
{Token.category == DT}
{Lookup.majorType == kind}
)
({NP}):super
) -->
…
…
University of Sheffield, NLP
Annotated text in GATE
16
University of Sheffield, NLP
Augmenting the Ontology
The new classes found are linked to existing classes in the
ontology
For existing fish, and new fish which we identified as a
synonym or hyponym of an existing fish, the link is to an
existing ontology instance
When we don't identify a link to any existing fish, we create
a new concept
The changes to the ontology are stored and can be
verified later by human experts
17
University of Sheffield, NLP
Generated “animal” ontology
18
University of Sheffield, NLP
Recognising components from
the ontology
• In addition to the standard IE components, we use some
special ontology components.
• The OntoRootGazetteer enables us to match words or
phrases in the text with classes, instances or properties in
an ontology, as any morphological variant
• Morphological analysis is performed on both text and
ontology, then matching is done between the two at the
root level.
• Text is annotated with features containing the root and
original string(s)
• When new elements are added to the ontology, these
features can be used to regenerate alternative forms
University of Sheffield, NLP
Modifying the ontology
• We developed a special GATE plugin called NEBOnE
(Named Entity Based ONtology Editor)
• This reuses technology taken from CLOnE (Controlled
Language ONtology Editor)
• CLOnE is designed to create new classes, instances etc
from raw (controlled) text generated by the user
• NEBOnE enables changes to be made to the ontology
based on information extraction from input texts (e.g. web
pages) in natural language
• Morphological analysis enables both root forms and
variants to be added to the ontology (as properties), along
with other variants (e.g. capitalisation)
University of Sheffield, NLP
Finding relations between known
elements
• In this case study, we use existing information from the
ontology to find relations between them. e.g. fish
species -- gear type
• We have already annotated all fish species, gear types,
fishing areas and so on in the text, based on ontology
lookup
• JAPE grammar first finds the subject of the document
(a gear type) and adds the information as a document
feature
• When a species name is found, we create a new
annotation for the relation “gear_used”, with a property
denoting the species, and another property denoting
the ID number of the gear.
University of Sheffield, NLP
Viewing relations
22
University of Sheffield, NLP
Using ANNIC to view results
• By running our application on a Lucene
datastore, we can then use ANNIC to view
the results
• Search for the pattern consisting of the name
of the relation annotation (in this case
“gear_used”)
• Show the relevant features (species, gear ID,
gear type)
23
University of Sheffield, NLP
Using ANNIC to view results
24
University of Sheffield, NLP
SPRAT
• Semantic Pattern Recognition and
Annotation Tool
• This is a generic version of SARDINE that
runs on all kinds of texts, not just fisheries
• Does not require a seed ontology
• Useful for building a domain ontology from
scratch
• Tested on wikipedia pages
University of Sheffield, NLP
How well can we do it?
• Traditional NE recognition on news texts:
~90% precision/recall
• Ontology-based information extraction on
news texts: ~80% precision/recall
• Pattern-based relation extraction on
Wikipedia texts: high accuracy but low
recall (or vice versa depending on setup)
• Relation finding between known entities:
~90% precision/recall
26
University of Sheffield, NLP
More information
Neon Project: http://www.neon-project.org
Neon Toolkit is freely available:
http://www.neon-toolkit.org
SARDINE application can be downloaded from
the GATE website
http://gate.ac.uk/projects/neon/sardine
Download