Balancing Expressivity and Implementability in OWL Ontologies for

advertisement
Balancing Expressivity and Implementability
in OWL Ontologies for
Semantic Data Frameworks:
The Journey from 2004 to 2009 and Beyond
Peter Fox
Tetherless World Constellation
RPI
Australia Ontology Workshop 2009
Outline
•
•
•
•
•
•
•
The origins of this effort
Why a framework and not a system?
Semantics in 2004
The design and development methods
Ontologies and the software and production!
Semantics between 2004 and 2009
Discussion of the expressivity and
implementability balance and one more …
• Since it is almost 2010 … what we are up to
Tetherless World Constellation
2
Background
Scientists should be able to access a global, distributed
knowledge base of scientific data that:
• appears to be integrated
• appears to be locally available
But… data is obtained by multiple instruments, using
various protocols, in differing vocabularies, using
(sometimes unstated) assumptions, with inconsistent
(or non-existent) meta-data. It may be inconsistent,
incomplete, evolving, and distributed
And… there exist(ed) significant levels of semantic
heterogeneity, large-scale data, complex data types,
legacy systems, inflexible and unsustainable
implementation technology…
3
Origins
• In 2000-2001 the need for capturing and preserving
knowledge in science data became very clear but the
barriers were high
• In 2004 we started a virtual observatory project based
on semantic technologies
• Use case driven – in solar and solar-terrestrial physics
with an emphasis on instrument-based measurements
and real data pipelines; we needed implementations
• We knew we also needed integration and provenance
(but that came later)
• We aimed to push semantics into our systems to build
new ‘prototypes’ but we ‘failed’ ;-)
Tetherless World Constellation
4
In 2004
•
•
•
•
•
•
2004 – OWL was a W3 recommendation!!
Protégé 2.x and the Protégé-Java-OWL API
SWOOP was a viable editor
Jena and the Jena API were in good shape
Pellet worked
SPARQL was still a twinkle in the RDF working
group’s eye
• Semantics were still the realm of computer
scientists – luckily we had one of the best
Tetherless World Constellation
5
Frameworks vs. Systems
• Prior to 2005, we built systems
• Rough definitions
– Systems have very well-define entry and exit
points. A user tends to know when they are using
one. Options for extensions are limited and
usually require engineering
– Frameworks have many entry and use points. A
user often does not know when they are using
one. Extension points are part of the design
• You don’t have to agree, this was our view
Tetherless World Constellation
6
Ontology Spectrum
Catalog/
ID
Thesauri
“narrower
term”
relation
Terms/
glossary
Informal
is-a
Formal Frames
is-a (properties)
Formal
instance
Value
Restrs.
Selected
Logical
Constraints
(disjointness,
inverse, …)
General
Logical
constraints
Originally from AAAI 1999- Ontologies Panel by Gruninger, Lehmann, McGuinness, Uschold, Welty;
– updated by McGuinness.
Description in: www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-abstract.html
7
Design and Development
• We made a conscious decision only to develop
ontologies that were required to answer
specific use cases
• We made a conscious effort to use whatever
ontologies were available**
• We were pretty sure that rules would be
needed
• We ignored query
Tetherless World Constellation
8
Content: Coupling Energetics
and Dynamics of Atmospheric
Regions
Community data
archive for
observations and
models of Earth's
upper atmosphere
and geophysical
indices and
parameters
needed to
interpret them.
Includes
browsing
capabilities by
periods,
instruments,
models, …
Content: Mauna Loa
Solar Observatory
Near real-time
data from Hawaii
from a variety of
solar instruments.
Source for space
weather, solar
variability, and
basic solar
physics
Other content used
too – CISM – Center
for Integrated Space
Weather Modeling
Virtual Observatories
Make data and tools quickly and easily accessible to a
wide audience.
Operationally, virtual observatories need to find the
right balance of data/model holdings, portals and
client software that researchers can use without
effort or interference as if all the materials were
available on his/her local computer using the user’s
preferred language: i.e. appear to be local and
integrated
Likely to provide controlled vocabularies that may be
used for interoperation in appropriate domains along
with database interfaces for access and storage and
“smart” tools for evolution and maintenance.
11
Early days of VxOs
?
VO2
VO3
VO1
DB1
DB2
DB3
…………
DBn
12
The Astronomy approach;
data-types as a service
Limited
interoperability
VO App1
VOTable
VO App2
VO App3
Simple
Spectrum
Access
Protocol
OGC: {WFS, WCS, WMS} and
SWE {SOS, SPS, SAS}
VO layer
Simple
Image Access
Protocol
Simple Time
Access
Protocol
use the same approach
Lightweight semantics
DB1
DB2
DB3
Limited
coded
meaning,
…………
Limited extensibility
Under review
hard
DBn
13
Science and technical
use cases
Find data which represents the state of the neutral
atmosphere anywhere above 100km and toward the
arctic circle (above 45N) at any time of high
geomagnetic activity.
– Extract information from the use-case - encode knowledge
– Translate this into a complete query for data - inference
and integration of data from instruments, indices and
models
Provide semantically-enabled, smart data query
services via a SOAP web for the Virtual IonosphereThermosphere-Mesosphere Observatory that
retrieve data, filtered by constraints on Instrument,
Date-Time, and Parameter in any order and with
constraints included in any combination.
14
Use Case example
• Plot the neutral temperature from the Millstone-Hill
Fabry Perot, operating in the non-vertical mode during
January 2000 as a time series.
• Plot the neutral temperature from the Millstone-Hill
Fabry Perot, operating in the non-vertical mode during
January 2000 as a time series.
• Objects:
–
–
–
–
–
–
–
Neutral temperature is a (temperature is a) parameter
Millstone Hill is a (ground-based observatory is a) observatory
Fabry-Perot is a interferometer is a optical instrument is a instrument
Non-vertical mode is a instrument operating mode
January 2000 is a date-time range
Time is a independent variable/ coordinate
Time series is a data plot is a data product
15
Knowledge representation
• Statements as triples: {subject-predicate-object}
interferometer is-a optical instrument
Fabry-Perot is-a interferometer
Optical instrument has focal length
Optical instrument is-a instrument
Instrument has instrument operating mode
Instrument has measured parameter
Instrument operating mode has measured parameter
NeutralTemperature is-a temperature
Temperature is-a parameter
• A query*: select all optical instruments which have
operating mode vertical
• An inference: infer operating modes for a Fabry-Perot
Interferometer which measures neutral temperature
16
Added value
Education, clearinghouses, other services, disciplines,
etc.
Semantic mediation layer - mid-upper-level
Semantic
interoperability
VO
Portal
Added value
VO
API
Web
Serv.
Added value
Semantic query,
hypothesis
and
inference
Mediation Layer
• Ontology - capturing concepts of Parameters,
Instruments, Date/Time, Data Product (and
Semantic mediation layer - VSTO - low level
associated classes, properties) and Service Classes
• Maps queries to underlying data
• Generates access requests for metadata,
data schema,
Metadata,
data
• Allows queries, reasoning, analysis, new hypothesis
generation, testing, explanation, etc.
Added value
DB1
DB2
DB3
…………
Fox - APAC 2007, Driving e-research:
Grids and Semantics
Query,
access and
use of data
DBn
17
Semantic
filtering
by
domain or instrument
hierarchy
Partial exposure of
Instrument
class
hierarchy - users seem
to LIKE THIS
Fox - APAC 2007, Driving e-research:
Grids and Semantics
18
19
Inferred plot type
and return required
axes data
20
Semantic Web Services
21
Semantic Web Services
OWL document returned
using VSTO ontology - can be
used both syntactically or
semantically
Fox - APAC 2007, Driving e-research:
Grids and Semantics
22
Semantic Web Benefits
• Unified/ abstracted query workflow: Parameters, Instruments, Date-Time
• Decreased input requirements for query: in one case reducing the number of
selections from eight to three
• Generates only syntactically correct queries: which was not always insurable in
previous implementations without semantics
• Semantic query support: by using background ontologies and a reasoner, our
application has the opportunity to only expose coherent query (portal and
services)
• Semantic integration: in the past users had to remember (and maintain codes)
to account for numerous different ways to combine and plot the data whereas
now semantic mediation provides the level of sensible data integration
required, and exposed as smart web services
– understanding of coordinate systems, relationships, data synthesis, transformations.
– returns independent variables and related parameters
• A broader range of potential users (PhD scientists, students, professional
research associates and those from outside the fields)
23
http://escience.rpi.edu/schemas/vsto_all.owl
Semantic Web Methodology and
Technology Development Process
Open World:
Evolve, Iterate,
Redesign, Redeploy
Rapid
Prototype
Leverage
Technology
Infrastructure
Adopt
Technology
Approach
Science/Expert
Review & Iteration
Use Tools
Evaluation
Analysis
Use Case
Small Team,
mixed skills
Develop
model/
ontology
25
Developing ontologies
• Use cases and small team (7-8; 2-3 domain/ data experts,
2 knowledge experts, 1 software engineer, 1 facilitator, 1
scribe)
• Identify classes and minimal properties (leverage
controlled vocab.)
–
–
–
–
Start with narrower terms, generalize when needed or possible
Adopt a suitable conceptual decomposition (e.g. SWEET)
Import modules when concepts are orthogonal
Add service classes and properties where needed
• Review, vet, publish
• Only code them (in RDF or OWL) when needed (CMAP, …)
• Ontologies: small and modular
26
Species validation
Tetherless World Constellation
27
Expressivity VSTO 1.0
Tetherless World Constellation
28
Expressivity VSTO dev. version
Tetherless World Constellation
29
Yikes
Tetherless World Constellation
30
Ontologies and the software
• Protégé 2.x and then 3.x built from our
ontology on the web
• Java class generation
• Eclipse as a development environment
• Leveraged a portal code base (from the Earth
System Grid project)
Tetherless World Constellation
31
32
2
33
Implementation choices
• Our big challenge was time – in use cases and in
the representation
– Depending on the level of granularity there were >
200,000 day-time records, and > 70,000,000 sub-day
time intervals – no triple store could handle this**
• We descoped our effort to delay use cases such
as: find all neutral temperature data around the
summer solstice for the last decade
• We chose a minimal time encoding in the
ontology and delegated that to a relational DB
• Reasoning in finite time does not mean 3-4 secs!
Tetherless World Constellation
34
VSTO - semantics and ontologies in an
operational environment: www.vsto.org
Web Service
Fox - APAC 2007, Driving e-research:
Grids and Semantics
35
Implications and OWL 1.0
• Lack of numeric support meant that the the
rules and procedural logic were implemented
in java, i.e. in the code
• On several occasions the tools (not to be
named) pushed us into OWL-Full, introduced
inconsistencies, etc.
• Finally, they stabilized, and in 2005 (and again
in 2006 and twice in 2007) we had stable
releases
Tetherless World Constellation
36
Evaluation
•
Highlights:
– Less clicks to data
– Auto identification and retrieval of independent variables & plotting support
– Faster
– Support for finding instruments (without specifying the id includes finding
data from instruments that the user did not know to ask for)
•
Questions (potentially with 35 responses)
– What do you like about the new searching interface? (9)
– Are you finding the data you need? (35: Yes=34, No=1)
– What is the single biggest difference? (8)
– How do you like to search for data? Browse, type a query, visual? (10,
Browse=7, Type=0, Visual=3)
– What other concepts are you interested in using for search, e.g. time of high
solar activity, campaign, feature, phenomenon, others? (5, all of these)
– Does the interface and services deliver the functionality, speed, flexibility you
require? (30, Yes=30, No=0)
– How often do you use the interface in your normal work? (19, Daily=13,
Monthly=4, Longer=2)
– Are there places where the interface/ services fail to perform as desired? (5,
Yes=1, No=4)
Tetherless World Constellation
37
Iteration
• We need the ability to evolve the ontology and not
break the framework
• As we broaden re-use of these ontologies and creation
of new ones
– We needed visual tools like CMAP Ontology Editor
– We needed the visual tools to work with the editing/
plugin tools – they do not
– We needed to use natural language forms but this ended
up being sparse but that need will increase
– Need tools aimed at software engineers and domain
scientists: three-pronged approach and interoperable:
• OWL in editors (e.g. Protégé, SWOOP, etc.)
• Visual (e.g. CMAP/COE)
• Natural Language (e.g. Rabbit, CL, Peng)
Tetherless World Constellation
38
Maintenance
• Support for collaborative feedback, evolution
• Change management
• Support for ‘comments’ and ‘annotations’, i.e.
self-documentation
• Package management: creation, dependency,
consistency checking
Tetherless World Constellation
39
Semantics between
2004 and 2009
•
•
•
•
•
•
•
•
•
•
•
Ontologies were needed for data integration
and provenance
and mediation for data mining
Protégé 3.x and then 4.0 came out
SWOOP development was interrupted
Cmap added OWL predicate support*
SPARQL became a recommendation
Triple stores exploded in use and capability
Linked Open Data started to take off
Pellet 2.0 came out
We invaded OWLED 2006, 2007, and 2009
Tetherless World Constellation
40
Semantic Web Layers
41
Other projects – ontologies for
faceted search
Tetherless World Constellation
42
For data integration
Tetherless World Constellation
43
Ontology packaging
Tetherless World Constellation
44
Provenance
Tetherless World Constellation
45
Discussion of E versus I
• We had to expand the balance to now include
maintainability (/ evolvability)
• E-M-I briefly
– E.g. modularization has become essential to facilitate
ontology packaging -> need to take advantage of OWL 2
– Separation of class and instances
• Makes visual development possible
• Also facilitates SPARQL end-point approaches
• As tools and applications improve we reconsider
our past choices
– Adding time** back into VSTO and moving to OWL 2
Tetherless World Constellation
46
2010
• Recently funded to take our developments into a
configurable SDF, thus we will push ontology
languages and tools on new ways:
• OWL 2 – RL in particular
– Annotations
– Property chaining
• SPARQL (yawn)
• RIF – probably not for a while
• However, the tools still lag behind – especially for
visual and natural language development
Tetherless World Constellation
47
Modularization
•
•
•
•
•
One of the primary goals of VSTO 2.0 is to modularize the VSTO ontology, e.g., an
instrument module does not require any other classes besides the instrument and
maybe an instrument operating mode to substantiate what an instrument is.
The problem with modularization, however, is that although a subset may
substantiate a concept, that concept, especially in VSTO, has a number of relations
linking it with other concepts within the ontology, for instance the instrument
module may measure a number of parameters in the parameter module, or have a
time coverage that would be defined in the time module.
Each observatory that the VSTO integrates data for will import only the modules
that are appropriate for the observatory's domain.
There are also some modules that will always be required, regardless of the
domain, like the instrument, parameter, and time modules. Each observatory
ontology has its own way of linking these modular concepts, which will be called
link properties.
This presents a problem, as the VSTO portal may not know which link property to
use to associate an instrument with a set of parameters or a time coverage, as it
becomes the responsibility of the ontology for the respective observatory to
define the link properties. Tetherless World Constellation
48
‘Interfaces’ or ‘Extensions’
• This is where the VSTO interface ontology comes in. It doesn't have to be
called the VSTO interface, it could be VSTO link properties, or anything for
that matter.
• The purpose of this ontology is to define a few link properties that will be
required for navigation to data in the VSTO portal. For instance, the
guided workflows as they work now, would require a number of link
properties. E.g. the Start by Instrument Workflow, the VSTO interface
would require an instrument and time coverage link property to get from
step 1 to step 2 in the workflow.
• In the case that an instrument of the CEDAR observatory is selected in
step 1, this link property could be created in a rule-based logic as…
– ( Instrument_1 hasInstrumentOperatingMode IOM_1 ^ IOM_1 hasDataset Dataset_1 ^
Dataset_1 hasTimeCoverage TimeInterval_1 ) => Instrument_1 hasTimeCoverage
TimeInterval_1
• Of course, this would have to be done for all instrument operating modes
and all datasets associated
with those operating modes to determine the
Tetherless World Constellation
full time coverage of an instrument.
49
OWL 2 considerations
•
What's good?:
– new syntactic sugar to simplify ontology
– ability to compare numerics
•
OWL 2 QL Synopsis:
– focused on ontology interoperability with database systems where scalable reasoning
and query answering over large numbers of instances is most important task
•
Why is it a good match?:
– synopsis above, query answering over a large number of time instances will have to be
performed
•
Why isn't it a good match?:
– does not support enumerations, a feature required by some concepts in VSTO
– does not support functional properties, a feature required by some properties in VSTO
– does not support property inclusions involving property chains, a feature we hope to
utilize to define rules for VSTO
– does not support keys, a feature we hope to add when Protege 4.1 released (along with
support for creation of keys)
Tetherless World Constellation
50
OWL 2 considerations
• OWL 2 RL Synopsis:
– focused on ontology interoperability with rule extended DBMSs where
scalable reasoning over large datasets is the most important task
• Likely final choice:
– supports all OWL features currently required by VSTO, including
enumerations and functional properties
– supports property inclusions involving property chains, so potential for
rules can be addressed, namely for reasoning over time intervals
– supports keys
Tetherless World Constellation
51
Back to Semantic Data
Frameworks
• With the substantial adoption of semantics in
science data applications
– There is a need for a higher level of application/
tool infrastructure
– Others are experiencing the same lessons with
ontology and application development
• We are aggregating our efforts into a:
Semantic eScience Framework (SESF)*
– Configurable, i.e. ontology loadable and driven
Tetherless World Constellation
52
Inference vs. Query
• The real power of semantic web in science is
likely to lay in the ability to balance
implementation choices between inference
(RDFS and OWL) and query (even SPARQL)
• It is clear to us that the effect upon
expressivity and maintainability will be an
essential consideration
– Recall the OWL-QL – OWL RL findings
• Also depends on how dynamic the KB is…
Tetherless World Constellation
53
I.e. SDF vs LOD
• Linked open data – RDFS and SPARQL
– http://linkeddata.org
• Emergent ontology versus, well, an engineered
one
– Current chaos due to owls:sameas
– Dynamic content
• One of the present challenges for us is to
accommodate the web of data into emerging
needs for federated search and access as SDFs are
curated..
• And yes, there is RDFS 2.0 to consider
Tetherless World Constellation
54
Summary
• We set out to build a prototype and ended up with a
production semantic data framework
– Language and tools served us well
• Even with modest expressivity we challenged the tools
of the time and made many compromises
• All along the way, we evaluated our ontology
developments and implementations to gauge the
benefits of semantics
• Maintainability, esp. modularization is driving new
expressivity needs
• We continue to need to bridge the computer science
and application communities
Tetherless World Constellation
55
Further Information
• http://tw.rpi.edu/portal/SESF
• Contacts:
– pfox@cs.rpi.edu
Tetherless World Constellation
56
Download