Ontotext @ JRC - Language Technology Resources

advertisement
Ontotext @ JRC
5-6 Oct 2005
Semantic Web
•
The Semantic Web is the abstract representation of data on the WWW,
based on the RDF and other standards
• SW is being developed by the W3C, in collaboration with a large number of
researchers and industrial partners
http://www.w3.org/2001/sw/
http://www.SemanticWeb.org
2/68
Ontotext @ JRC
5-6 Oct 2005
Semantic Web (II)
•
"The Semantic Web is an extension of the current web in which information is
given well-defined meaning, better enabling computers and people to work in
cooperation.“ [Berners-Lee et al. 2001]
The spirit:
• Automatically processable
metadata regarding:
– the structure (syntax) and
– the meaning (semantics)
– of the content.
• Presented in a
standard form;
• Dynamic interpretation
for unforeseen purposes
3/68
Ontotext @ JRC
5-6 Oct 2005
Semantic Web: Languages
•
•
•
•
•
•
RDF(S) – the next slides
SHOE, XOL, etc – the pioneers
Topic Maps – a metadata language with limited impact
OIL – Ontology Interchange Language, the basis of the next two
http://www.ontoknowledge.org/oil/
– Description Logics-based multilayered language
DAML+OIL – the predecessor of OWL, not to be developed
OWL – the W3C standard for Semantic Web ontology language,
http://www.w3.org/2001/sw/WebOnt/
– Extends RDF(S), but also constraints it
– Has multiple layers (Lite, DL, Full)
– Transitive/symmetric/etc properties, disjointness, cardinality restrictions
4/68
Ontotext @ JRC
5-6 Oct 2005
Semantic Web: Problems
•
Critical mass of metadata is necessary
•
Still lack of consensus on many issues (like query languages)
•
Lack of practices at the proper scale and complexity
•
Lack of robust Semantic (in our days RDFS) repositories:
– Should be as flexible, multi-purpose and easy to use as HTTP servers
and
– As efficient in structured knowledge management as RDBMS
5/68
Ontotext @ JRC
5-6 Oct 2005
What are Sirma & Ontotext?
•
•
•
•
•
Established in 1992 as a Bulgarian AI Lab.
Current structure:
– Sirma Group International Corp, Montreal, Canada;
– 8 subsidiary companies; the most important ones follow below.
Sirma AI, Sofia
– The R&D backbone of the group with two divisions:
– Sirma Solutions: e-Business, banking, C3, e-Publishing, consultancy;
– Ontotext Lab: Knowledge and Language Engineering.
EngView Systems, Montreal
– CAD/CAM systems and applications.
WorkLogic.Com, Ottawa
– Web-based collaboration, workflow, e-Gov.
6/68
Ontotext @ JRC
5-6 Oct 2005
Software Development and Research since 1992
• Track record of success – large companies and government
organizations in US, Canada,
Western Europe and Bulgaria;
• Top-3 Software Company in Bulgaria;
• About 70 developers;
• ISO 2001 Certificate;
• 1999 EIST prize winner;
7/68
Ontotext @ JRC
5-6 Oct 2005
Sirma Businesses and Domains
Diverse business, ranging from COTS products to custom projects,
consultancy, and outsourcing services.
Major areas:
• AI – expert systems (beside Ontotext);
• b2b market places
• CAD/CAM (for packaging, quality control)
• e-Government, CSCW, Groupware, Workflow;
• Banking
• C3/C4 Systems (military, airport traffic);
• VOIP billing systems;
• e-Publishing, Proofing tools.
8/68
Ontotext @ JRC
5-6 Oct 2005
Ontotext Lab
An R&D lab of Sirma for
Knowledge and Language Engineering
Research and core technology development for
knowledge discovery, management, and engineering.
Specialized for applications in Semantic Web, Knowledge Management, and
Web Services.
Aside from the scientific matters, most of us are just professional software
developers.
9/68
Ontotext @ JRC
5-6 Oct 2005
Leading Semantic Web Technology Provider
Ontotext is a leading Semantic Web technology provider, being:
• the developer of the KIM Semantic Annotation Platform and
• a co-developer of the GATE language engineering platform;
• a co-developer of the Sesame semantic repository and OWLIM highperformance OWL reasoner;
• the developer of the WSMO4J semantic web services API;
• a partner in the SWAN Semantic Web Annotator project.
Ontotext is part of most of the major European research projects in the field;
the most successful Bulgarian participant in FP6.
10/68
Ontotext @ JRC
5-6 Oct 2005
Mission
•
A critical mass of research in a number of AI areas made efficient KM
almost possible.
•
the technology on the market is mostly of two sorts:
– Expensive black boxes
– Academic prototypes
Our mission is:
•
To develop and popularize open, skillfully engineered tools...
•
For Information Extraction and Knowledge Management,
•
Which considerably reduce the cost for implementation and use of KM
applications.
11/68
Ontotext @ JRC
5-6 Oct 2005
Major Research Areas
We focus on building cutting-edge expertise and technology in the following
areas:
•
ontology design, management, and alignment;
•
knowledge representation, reasoning;
•
information extraction (IE), applications in IR;
•
semantic web services;
•
upper-level ontologies and lexical semantics;
•
NLP: POS, gazetteers, co-reference resolution, named entity recognition
(NER)
•
machine learning (HMM, NN, etc.)
12/68
Ontotext @ JRC
5-6 Oct 2005
Academic & Technology Partners
•
NLP Group, Sheffield University, UK;
•
Digital Enterprise Research Institute (DERI),
Institut für Informatik, Innsbruck, Austria, and
National University of Ireland, Galway;
•
Aduna (Aidministrator) b.v., The Nederland's;
•
Linguistic Modelling Lab.
CLPOI, Bulgarian Academy of Sciences;
•
British Telecommunications Plc, (BT), UK.
•
Froschungszentrum Informatik (FZI) and Institut AIFB
Karlsruhe, Germany.
13/68
Ontotext @ JRC
5-6 Oct 2005
Customers
•
SemanticEdge GmBH, Berlin, Germany;
•
QinetiQ Ltd, UK;
•
Fairway Consultants, UK;
14/68
Ontotext @ JRC
5-6 Oct 2005
Research Projects
We were/are part of a number of FP5 research projects:
•
On-To-Knowledge - the project which invented OIL.
Ontology Middleware Module and a DAML+OIL reasoner.
•
VISION - Towards Next Generation Knowledge Management.
•
OntoWeb - Ontology-based information exchange for knowledge
management ….
•
SWWS - Semantic Web enabled Web Services.
15/68
Ontotext @ JRC
5-6 Oct 2005
Research Projects (II)
FP6 integrated projects that started Jan 2004, durations ~3 years:
•
SEKT: Semantic Knowledge Technologies. Targeting a synergy of
Ontology and Metadata Technology, Knowledge Discovery and Human
Language Technology.
•
DIP: Data, Information, and Process Integration with Semantic Web
Services.
•
PrestoSpace: Preservation towards storage and access. Standardized
Practices for Audiovisual Contents in Europe.
•
Infrawebs: Intelligent Framework for Generating Open (Adaptable) Development
Platforms for Web-Service Enabled Applications Using Semantic Web Technologies,
Distributed Decision Support Units and Multi-Agent-Systems
16/68
Ontotext @ JRC
5-6 Oct 2005
Introduction to Ontologies
Despite the formal definitions, ontologies are:
•
Conceptual models or schemata
– Represented in a formalism which allows
– Unambiguous “semantic” interpretation
– Inference
•
Can be considered a combination of:
– DB schema
– XML Schema
– OO-diagram (e.g. UML)
– Subject hierarchy/taxonomy (think of Yahoo)
– Business logic rules
17/68
Ontotext @ JRC
5-6 Oct 2005
Introduction to Ontologies (II)
•
•
•
•
Imagine a DB storing
“John is a son of Mary”.
It will be able to "answer" just:
– Which are the sons of Mary? Which son is John?
An ontology with a definition of the family relationships. It could
infer:
– John is a child of Mary (more general)
– Mary is a woman;
– Mary is the mother of John (inverse);
– Mary is a relative of John (generalized inverse).
The above facts, would remain "invisible" to a typical DB, which
model of the world is limited to data-structures of strings and
numbers.
18/68
Ontotext @ JRC
5-6 Oct 2005
Products
•
The Ontology Middleware Module (OMM) is an enterprise back-end for formal KR
and KM applications based on Semantic Web standards
•
An extension of the Sesame
RDF(S) repository that adds
a Knowledge Control System.
•
OMM integration options:
Built-In, RMI,
SOAP, HTTP.
MetaInformation
St o
T r a re as
ck b
y
nd
d a by
ltere ved
Fi e ser
pr
T
racking
C
hanges
Knowledge
Control System
ChangeInvestigation
Curr
en t U
serInfo.
Ontotext @ JRC
A
ccess
C
ontrol
19/68
5-6 Oct 2005
Products
•
BOR – a DAML+OIL reasoner.
•
Proprietary GATE components:
– Hash Gazetteer. A high-performance lookup tool.
– Hidden Markov Model Learner. A stohastic module for
filtering annotations, disambiguation, (etc.,) based on
confidence measures.
•
The News Collector is a web service, collecting and indexing articles from
the top-10 global news wires:
–
About 1000 articles/day, annotated and indexed using KIM;
–
Used to validate the heuristics and resources of KIM;
20/68
Ontotext @ JRC
5-6 Oct 2005
Products (II)
•
The KIM Platform (the next slides), http://www.ontotext.kim.
•
SWWS Studio (http://swws.ontotext.com)
– Semantic Web Service description development environment
– Developed in the course of the SWWS project
– Based on WSMO (http://www.wsmo.org)
•
WSMO4J (http://wsmo4j.sourceforge.net)
– A WSMO API and a reference implementation
– for building Semantic Web Services applications
– Used in WSMO Studio, (http://www.wsmostudio.org/)
– The basis for ORDI, used in OMWG (http://www.omwg.org)
– Used in projects DIP, SEKT, Infrawebs
21/68
Ontotext @ JRC
5-6 Oct 2005
OWLIM
• OWLIM is a high-performance OWL repository
• Storage and Inference Layer (SAIL) for Sesame RDF database
• OWLIM performs OWL DLP reasoning
• It is uses the IRRE (Inductive Rule Reasoning Engine) for forward-chaining
and “total materialization”
• In-memory reasoning and query evaluation
• OWLIM provides a reliable persistence, based on RDF N-Triples
• OWLIM can manage millions of statements on desktop hardware
• Extremely fast upload and query evaluation even for huge ontologies and
knowledge bases
22/68
Ontotext @ JRC
5-6 Oct 2005
Scalability: Upload and Reasoning
Upload speed (statements/sec)
20000
2Xeon1GB
18000
2Opt3GB
16000
2Opt5GB
14000
PM512MB
12000
1/log
10000
8000
6000
4000
2000
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
8
8.5
9
9.5
10
Size of repository (millions of explicit statements)
23/68
Ontotext @ JRC
5-6 Oct 2005
Scalability: Query Answering
400
Evaluation time Q2 (msec)
350
300
250
200
150
2Xeon1GB
100
2Opt3GB
50
2Opt5GB
PM512MB
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
8
8.5
9
9.5
10
Size of repository (millions of explicit statements)
• Q2: Pattern of 12 statement-joins and LIKE literal constraint
24/68
Ontotext @ JRC
5-6 Oct 2005
OWLIM under LUMB Benchmark
• The Lehigh Univ. evaluation is one of the most comprehensive benchmark
experiments published recently (ISWC 2004, WSJ 2005)
• Synthetically generated OWL knowledge bases
• The biggest set generated is LUMB(50,0) – 6M explicit statements
• 14 queries, checking different inferences
• OWLIM on LUMB:
– On a desktop machine OWLIM loads LUMB(50,0) in 10 min
– The only other systems known to load it, does this for 12 hours
– All the queries are answered correctly
• Based on this we can claim that:
– OWLIM is the fastest OWL repository in the world!
25/68
Ontotext @ JRC
5-6 Oct 2005
JOCI
•
“Jobs & Contacts Intelligence”, Innovantage, Fairway Consultants
•
Gathering recruitment-related information from web-sites of UK
organizations
•
Offering services on top of this data to recruitment agencies, job portals,
and other.
•
•
JOCI uses KIM for information extraction (IE, text-mining)
JOCI makes use of a domain ontology to:
– support the IE process,
– to structure the knowledge base with the obtained results, and
– facilitate semantic queries.
•
Sirma is shareholder in Fairway Consultants
26/68
Ontotext @ JRC
5-6 Oct 2005
JOCI Dataflow
UK Web
Space
Web UI
Focused Crawler
Crawler
Classifier
Information Extraction
KIM Server
Single-Document IE
Semantic Repository
Object Consolidation
Document Store
27/68
Ontotext @ JRC
5-6 Oct 2005
JOCI: Vacancy Consolidation/Matching
Consolidated Vacancy
locatedIn
Vacancy 1
Vacancy 2
hasJobTitle
“IT Applications
Support Analyst”
locatedIn
U.K.
sub-string
Scotland
subRegionOf
locatedIn
“Support
Analyst”
subRegionOf
type
Glasgow
type
Country
City
subClassOf
Location
28/68
Ontotext @ JRC
5-6 Oct 2005
JOCI Statistics
• The figures below are indicative and reflect an old state of the JOCI system:
– The actual figures are to be announced after the launch of JOCI
• Web-sites inspected: 0.5M
• Web-sites with vacancy announcements: 30K
• Extracted vacancies: 100K
29/68
Ontotext @ JRC
5-6 Oct 2005
The KIM Platform
• A platform offering
services and infrastructure for:
– (semi-) automatic semantic annotation and
– ontology population
– semantic indexing and retrieval of content
– query and navigation over the formal knowledge
• Based on Information Extraction technology
30/68
Ontotext @ JRC
5-6 Oct 2005
KIM What’s Inside?
The KIM Platform includes:
•
Ontologies (PROTON + KIMSO + KIMLO) and KIM World KB
•
KIM Server – with a set of APIs for remote access and integration
•
Front-ends: Web-UI and plug-in for Internet Explorer.
31/68
Ontotext @ JRC
5-6 Oct 2005
The AIM of KIM
• Aim: to arm Semantic Web applications
-
by providing a metadata generation technology
-
in a standard, consistent, and scalable framework
32/68
Ontotext @ JRC
5-6 Oct 2005
What KIM does?
Semantic Annotation
33/68
Ontotext @ JRC
5-6 Oct 2005
Simple Usage: Highlight, Hyperlink, and…
34/68
Ontotext @ JRC
5-6 Oct 2005
Simple Usage: … Explore and Navigate
35/68
Ontotext @ JRC
5-6 Oct 2005
Simple Usage: … Enjoy a Hyperbolic Tree View
36/68
Ontotext @ JRC
5-6 Oct 2005
KIM is Based On…
KIM is based on the following open-source platforms:
• GATE – the most popular NLP and IE platform in the world, developed at the
University of Sheffield. Ontotext is its biggest co-developer.
www.gate.ac.uk and www.ontotext.com/gate
• OWLIM – OWL repository, compliant with
Sesame RDF database from Aduna B.V.
www.ontotext.com/owlim
• Lucene – an open-source IR engine by Apache. jakarta.apache.org/lucene/
37/68
Ontotext @ JRC
5-6 Oct 2005
How KIM Searches Better
KIM can match a Query like:
Documents about a telecom company in Europe, John Smith, and a date in the
first half of 2002.
With a document containing:
“At its meeting on the 10th of May, the board of Vodafone appointed John G.
Smith as CTO"
The classical IR could not match:
– Vodafone with a "telecom in Europe“, because:
• Vodafone is a mobile operator, which is a sort of a telecom;
• Vodafone is in the UK, which is a part of Europe.
– 5th of May with a "date in first half of 2002“;
– “John G. Smith” with “John Smith”.
38/68
Ontotext @ JRC
5-6 Oct 2005
Entity Pattern Search
39/68
Ontotext @ JRC
5-6 Oct 2005
Pattern Search: Entity Results
40/68
Ontotext @ JRC
5-6 Oct 2005
Entity Pattern Search: KIM Explorer
41/68
Ontotext @ JRC
5-6 Oct 2005
Semantic Metadata in KIM…
•
Provides a specific metadata schema,
– focusing on named entities (particulars),
– as well as number and time-expressions, addresses, etc.,
– everything “specific”, apart from the general concepts.
•
Defines specific tasks for generation and usage of the metadata which are
well-understood and measurable.
•
Why not metadata about general things (universals)?
– It is too complex…
– but we leave the door open.
•
The particulars seem to provide a good 80/20 compromise.
42/68
Ontotext @ JRC
5-6 Oct 2005
World Knowledge in KIM
Rationale:
•
The ontology is encoded in OWL Lite and RDF.
•
provide common knowledge about world entities;
•
KIM bets on scale and avoids heavy semantics;
minimum modeling of common-sense, almost no axioms;
•
The ontology is encoded in OWL Lite and RDF.
•
In addition, a number of rules (generative axioms) are defined, e.g.:
<X,locatedIn,Y> and <Y,subRegionOf,Z> =>
<X,locatedIn,Z>
•
Axioms of this sort are supported by OWLIM and they provide a consistent
mechanism for “custom” extensions to the OWL or RDF(S) semantics with respect
to a particular ontology
43/68
Ontotext @ JRC
5-6 Oct 2005
PROTON
•
•
Name. PROTON is an acronym for
Proto Ontology
– ex-names: BULO (basic upper-level ontology), GO (generic ontology);
– not a Russian space rocket 
– “proto” – used in the sense of “primary”, “beginning”, “giving rise to”, vs. “first in
time” or “oldest”;
– connotations: positive, fundamental, elemental, “in favour of”, even romantic
(like a science-fiction novel from the 60-ies) 
Intended usage. A Basic Upper-Level Ontology like PROTON - used for:
– ontology population
– knowledge modelling and integration strategy of a KM environment;
– generation of domain, application, and other ontologies.
44/68
Ontotext @ JRC
5-6 Oct 2005
PROTON Design
•
Design principles:
1. domain-independence;
2. light-weight logical definitions;
3. Compliance with popular metadata standards;
4. good coverage of concrete and/or named entities (i.e. people,
organizations, numbers);
5. no specific support for general concepts (such as “apple”, “love”, “walk”),
however the design allows for such extensions
45/68
Ontotext @ JRC
5-6 Oct 2005
Some Figures…
•
PROTON defines about
250 classes and 100 properties
•
Providing coverage of most of the upper-level concepts necessary for semantic
annotation, indexing, and retrieval
•
A modular architecture, allowing for great flexibility of usage and extension:
– SYSTEM module - contains a few meta-level primitives (6 classes and 7
properties); introduces the notion of 'entity', which can have aliases;
– TOP module - the highest, most general, conceptual level, consisting of about 20
classes;
– UPPER module - over 200 general classes of entities, which often appear in
multiple domains.
46/68
Ontotext @ JRC
5-6 Oct 2005
PROTON Ontology Language
•
The current version of the ontology is encoded in OWL Lite.
•
A few custom entilement rules (axioms) are also defined for usage in tools that support
them, for instance:
Premise:
<xxx, protont:roleHolder, yyy>
<xxx, protont:roleIn, zzz>
<yyy, rdf:type, protont:Agent>
Consequent:
<yyy, protont:involvedIn, zzz>
•
Axioms of this sort are interpreted by OWLIM
•
PROTON is portable to any OWL(Lite)-compliant tool.
•
PROTON can be used without such axioms either.
47/68
Ontotext @ JRC
5-6 Oct 2005
Other Standards: Relations
•
•
•
•
•
•
ADL Feature Type Thesaurus and GNS
– the backbone of the Location branch;
– on its turn aligned with the geographic feature designators, of the GNS database of
NIMA;
– PROTON is more coarse-grained, taking about 80 out of 300 types.
Dublin Core
– the basic element set available as properties of protont:InformationResource and
protont:Document classes;
– the resource type vocabulary is mapped to sub-classes of InformationResource.
OpenCyc and WordNet– consulted and referred to in glosses.
ACE (Automatic Content Extraction) annotation types – covered.
FOAF – assure easy mapping (e.g. the Account class was added).
DOLCE, EuroWordnet Top, and others – consulted to various extent.
48/68
Ontotext @ JRC
5-6 Oct 2005
Other Standards: Compliance
•
Other models are not directly imported (for consistency reasons)
•
The mapping of the appropriate primitives is easy, on the basis of
– a compliant design, and
– formal notes in the PROTON glosses, which indicate the appropriate
mappings.
•
For instance, in PROTON, a protont:inLanguage property is defined
– as an equivalent of the dc:language element in Dublin Core
– with a domain protont:InformationResource
– and a range protont:Language
49/68
Ontotext @ JRC
5-6 Oct 2005
KIM World KB
A quasi-exhaustive coverage of the most popular entities in the world …
•
What a person is expected to have heard about that is beyond the horizons
of his country, profession, and hobbies.
•
Entities of general importance … like the ones that appear in the news …
KIM “knows”:
•
Locations: mountains, cities, roads, etc.
•
Organizations, all important sorts of: business, international, political,
government, sport, academic…
•
Specific people, etc.
50/68
Ontotext @ JRC
5-6 Oct 2005
KIM World KB: Entity Description
•
•
•
•
•
The NE-s are represented with their Semantic Descriptions via:
Aliases (Florida & FL);
Relations with other entities (Person hasPosition Position);
Attributes (latitude & longitude of geographic entities);
their proper Class
51/68
Ontotext @ JRC
5-6 Oct 2005
The Scale of KIM World KB
RDF Statements
Small KB
- explicit
Full KB
444,086
2,248,576
1,014,409
5,200,017
40,804
205,287
12,528
35,590
261
261
- Province:
4,262
4,262
- City:
4,400
4,417
- Organization:
8,339
146,969
- Company:
7,848
146,262
6,022
6,354
64,589
429,035
- after inference
Instances
- Entity:
- Location:
- Country:
- Person:
- Alias:
52/68
Ontotext @ JRC
5-6 Oct 2005
KIM IE Pipeline
53/68
Ontotext @ JRC
5-6 Oct 2005
JAPE Grammars
•
Jape grammars are based on the last MUSE version
•
Class/instance information included
•
Better class granularity in grammars
•
Relation recognition grammars - LocatedIn and
HasPositionWithinOrganization
54/68
Ontotext @ JRC
5-6 Oct 2005
Disambiguation & Filtering
•
•
•
•
Simple disambiguation (longest match), e.g. San Francisco Journal
Based on the main alias, e.g. “Beijing”
By priority of the class, instance or relative class priority
– E.g. Brand “Microsoft” vs. Company “Microsoft Corp.”
– We assign a priority (1-1000) to each class and instance
– For pairs of classes we define relative priority
– If the difference between the priorities is greater than a certain threshold
the possible reference to the entity with the lower priority is ignored
Still to be improved
55/68
Ontotext @ JRC
5-6 Oct 2005
KIM Scaling on Data
• The Semantic Repository is based on OWLIM
• In our practical tests we observe perfect performance on top of:
– 1.2M of entity descriptions:
– about 15M explicit statements;
– above 30M statements after forward chaining.
• Document and Annotation storage and indexing with Lucene:
– One million docs, processed on a $1000-worth machine;
– retrieval in milliseconds.
56/68
Ontotext @ JRC
5-6 Oct 2005
Entity Ranking: a sketch for Jan-May 2004
No
Instance
Label
Rank
1
Country_T.5
United States
0.032
4
Country_T.IZ
Republic of Iraq
0.011
6
Person_T.51
George W. Bush
0.010
9
Country_T.IS
State of Israel
0.006
11
DayOfWeek_T.4
Tuesday
0.005
12
NewsAgency_T.6
The Associated Press
0.005
14
InternationalOrganization_T.13
United Nations
0.005
27
Country_T.CH
People's Republic of China
0.004
32
City_T.3068
New York
0.004
36
InternationalOrganization_T.18
European Union
0.004
40
Person_T.115
Ariel Sharon
0.003
43
Country_T.JA
Japan
0.003
44
Country_T.UK
United Kingdom
0.003
45
CountryCapital_T.93
Baghdad
0.003
57/68
Ontotext @ JRC
5-6 Oct 2005
SWAN/KIM Cluster Architecture
• At present, KIM is used for massive semantic annotation in the context of
the SWAN and SEKT projects
Here are some of its features:
• support for a virtually unlimited number of annotators
• centralized ontology storage and querying;
• centralized meta-data (annotations) and document storage, indexing, and
querying;
• support for multiple crawlers (or other data sources);
• dynamic reconfiguration of the cluster (e.g. staring new crawlers or
annotators on demand).
58/68
Ontotext @ JRC
5-6 Oct 2005
SWAN/KIM Cluster Console
59/68
Ontotext @ JRC
5-6 Oct 2005
SWAN Project:
Semantic Web Annotator
Large Scale Annotation of human language for the Semantic Web using Human
Language Technology (HLT).
Hosted by DERI (NUIG, Galway) and involves also:
• GATE team (from the Sheffield University's NLP Group) and
• Ontotext Lab.
• For more details take a look at http://deri.ie/projects/swan/
The current status:
• KIM Cluster of 7 servers in DERI
• Above 0.5TB shared storage
• 6 AMD64 Opterons, 6 Xeons, 36GB RAM
60/68
Ontotext @ JRC
5-6 Oct 2005
CoreDB: Name and Goals
•
CoreDB is a component of KIM
•
Stands for: Co-Occurrence and Ranking of Entities DB
•
In a nutshell, it is designed to allow fast queries of the sort:
– Q1: the number of appearances of “UK” in documents during Jan 2005
– Q2: all people co-occurring with John Smith and some bank institution
in documents from the second half of 2003
– Q3: Q2 + where the documents contain “fraud” and the name of the
institution contains “capital”
61/68
Ontotext @ JRC
5-6 Oct 2005
CoreDB: Functionality
•
It allows asking in a structured manner for:
– The number of references to entities in a (sub-)set of documents
– The entities, which co-occur together with other entities
•
Entities can be constrained by:
– Class (and its sub-classes)
– Keyword/token in one of its names/aliases/labels
•
Documents can be constrained according to DC-like features:
– Date (range; could be any date in the doc)
– Type (exact match; could be any string)
– Authors
– Title and Sub-title
– Keyword/token in the content, authors or the title fields
62/68
Ontotext @ JRC
5-6 Oct 2005
The Scale of Ambition
•
The major point is to allow such queries in *efficient* manner over data with
the following cardinality:
– 10^6 entities/terms
– 10^7 documents
– 10^2 entities occurring in an average document
•
This means managing and querying efficiently 10^9 entity occurrences
•
We had tested the current implementations with 10^7 occurrences and it
answers the basic queries in milliseconds.
63/68
Ontotext @ JRC
5-6 Oct 2005
CoreDB Applications
•
•
•
Detection of “associative” links between entities, based on cooccurrence in documents
– It is an alternative of the detection of strong links based on local context
parsing
Ranking, measuring popularity, of an entity over a set of documents
– The ranking is as good/relevant/representative as the set of documents
is
Computing timelines (changes over time) for entity ranking or cooccurrence
– “How did our popularity in the IT press changed during June”
(i.e. “What is the effect of this 1.5MEuro media campaign ?!?”)
– “How does the strength of association between organization X and RDF
changes over Q1 ?”
64/68
Ontotext @ JRC
5-6 Oct 2005
Implementation
•
•
•
•
•
It is a new component in the architecture of KIM
– Having an API (part of the KIM API), allows different implementations
There are now a couple of RDBMS-based implementations:
– Derby (free, open-source, 100% Java, was Cloudscape from IBM)
– ORACLE (v. 10g)
The Derby implementation – does not allow for efficient searches involving
keywords
The ORACLE implementation is used also for FTS-style indexing of the document
contents
– Makes possible efficient combination of semantic and keyword search (which is
already available through the SemanticQuery API)
In both RDBMS implementations:
– Part of the ontology and the KB are replicated
– Same with part of the document and index related information
65/68
Ontotext @ JRC
5-6 Oct 2005
Ontotext Facts
•
Founded year 2000
•
14 employees (permanent, without the shared personnel and associates)
•
Daily statistics for http://www.ontotext.com, over: 150 visits; 2000 hits
•
Number of scientific publications: above 30
•
Number of projects running: 9
•
More than 20 partners we directly cooperate with on projects
•
Average age: about 28
•
Number of servers per developer: 0.7
66/68
Ontotext @ JRC
5-6 Oct 2005
Ontotext Lab
Robust Technology
and Professional Services for
Knowledge and Language Engineering
http://www.ontotext.com
67/68
Ontotext @ JRC
5-6 Oct 2005
Download