CSE 291 - SDSC Staff Home Pages

advertisement

Department of Computer Science & Engineering

University of California, San Diego

CSE-291: Ontologies in Data Integration

Spring 2003

Bertram Ludäscher

LUDAESCH@SDSC.EDU

CSE-291: Ontologies in Data Integration

Course Overview

• Introduction to ontologies:

– What are ontologies (and some related “beasts”)?

– How do we represent ontologies?

– What can we do with them/to them?

• Introduction to some specific formalisms

– Logic, Description Logics, OWL, FCA, TMs, ...

• Some guest lectures

• “Class Action”:

– Theoretical studies:

• surveying/comparing/analyzing approaches and concrete ontologies

(based on research literature)

– Practical studies:

• applying an ontology/KR tool and methodology to a concrete domain

CSE-291: Ontologies in Data Integration

Acknowledgements and Credits

• National Science Foundation (NSF)

– www.nsf.gov

• GEOsciences Network (NSF)

– www.geongrid.org

• Biomedical Informatics Research Network (NIH)

– www.nbirn.net

• Science Environment for Ecological Knowledge (NSF)

– seek.ecoinformatics.org

• Scientific Data Management Center (DOE)

– sdm.lbl.gov/sdmcenter/

• Last not least (background – and foreground material ;-)

Carole Goble, Nigel Shadbolt [Ontologies and the Grid Tutorial], Robert

Stevens, Ian Horrocks [Fact] , Alexander Maedche, Steffen Staab [ISWC

Tutorial], Stefan Decker, Nicola Guarino, John Sowa, ...

CSE-291: Ontologies in Data Integration

Information Integration, Ontologies, and

Scientific Data

• Some “e-Science” / “cyberinfrastructure” projects: applying IT in difference scientific domains:

• Often: Share, interoperate, mediate, integrate data to...

– ... support scientific data and information management

– ... facilitate knowledge discovery

CSE-291: Ontologies in Data Integration

Data / Information Integration

CSE-291: Ontologies in Data Integration

An Online Shopper’s Information Integration Problem

El Cheapo: “Where can I get the cheapest copy (including shipping cost) of

Wittgenstein’s Tractatus Logicus-Philosophicus within a week ?” addall.com

?

Information

Integration

“One-World”

Mediation amazon.com

barnes&noble.com

half.com

A1books.com

A Home Buyer’s Information Integration Problem

What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood with below-average crime rate and diverse population?

?

Information

Integration

“Multiple-Worlds”

Mediation

Realtor Crime Stats School Rankings Demographics

A Geoscientist’s Information

Integration Problem

What is the distribution and U/ Pb zircon ages of A-type plutons in VA?

How about their 3-D geometry ?

How does it relate to host rock structures?

?

Information

Integration

“Complex

Multiple-Worlds”

Mediation

Geologic Map

(Virginia)

GeoChemical

GeoPhysical

(gravity contours)

GeoChronologic

(Concordia)

Foliation Map

(structure DB)

A Neuroscientist’s Information

Integration Problem

Biomedical Informatics

Research Network http://nbirn.net

What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity?

How about other rodents?

?

Information

Integration

“Complex

Multiple-Worlds”

Mediation protein localization

(NCMIR) sequence info

(CaPROT) morphometry

(SYNAPSE) neurotransmission

(SENSELAB)

Standard (XML-Based) Mediator Architecture

Integrated View

Definition

G(..)

S

1

(..)…S k

(..)

USER/Client

Query Q ( G (S

1

,..., S k

) )

Integrated Global

(XML) View G

MEDIATOR

(XML) Queries & Results

(XML) View

Wrapper

S

1

(XML) View

Wrapper

S

2

(XML) View

Wrapper wrappers implemented as web services

S k

CSE-291: Ontologies in Data Integration

Some BIRNing Data

Integration Questions

Biomedical Informatics

Research Network http://nbirn.net

• Data Integration Approaches:

– Let’s just share data, e.g

., link everything from a web page!

– ... or better put everything into an relational or XML database

– ... and do remote access using the Grid

– ... or just use Web services!

• Nice try. But:

– “Find the files where the amygdala was segmented.”

– “Which other structures were segmented in the same files?”

– “Did the volume of any of those structures differ much from normal?”

– “What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents?”

CSE-291: Ontologies in Data Integration

Information Integration Challenges

Semantics

Structure

Syntax

System aspects

 reconciling S 4 heterogeneities

 “gluing” together multiple data sources

 bridging information and knowledge gaps computationally

CSE-291: Ontologies in Data Integration

• System aspects: “Grid” Middleware

– distributed data & computing

– Web Services, WSDL/SOAP, …

– sources = functions, files, databases, …

• Syntax & Structure:

XML-Based Mediators

– wrapping, restructuring

– XML queries and views

– sources = XML databases

• Semantics:

Model-Based/Semantic Mediators

– conceptual models and declarative views

– SemanticWeb/KnowledgeGrid stuff: ontologies, description logics (RDF(S),

DAML+OIL, OWL ...)

– sources = knowledge bases (DB+CMs+ICs)

Information Integration from a DB Perspective

• Information Integration Problem

Given : data sources S

1

, ..., S k questions Q

1

,..., Q n

(DBMS, web sites, ...) and user that can be answered using the S i

Find : the answers to Q

1

, ..., Q n

• The Database Perspective: source = “database”

S i

S i has a schema can be queried

(relational, XML, OO, ...)

 define virtual (or materialized) integrated views V over

S

1

,..., S k using database query languages

 questions become queries Q i against V(S

1

(SQL, XQuery,...)

,..., S k

)

CSE-291: Ontologies in Data Integration

What’s the Problem with XML & Complex Multiple-Worlds?

• XML is Syntax

– DTDs talk about element nesting

– XML Schema schemas give you data types

– need anything else? => write comments !

• Domain Semantics is complex:

– implicit assumptions, hidden semantics

 sources seem unrelated to the non-expert

• Need Structure and Semantics beyond XML trees!

 employ richer OO models

 make domain semantics and “ glue knowledge ” explicit

 use ontologies to fix terminology and conceptualization

 avoid ambiguities by using formal semantics

CSE-291: Ontologies in Data Integration

Knowledge Representation:

Relating Theory to the World via Formal Models

Source: John F. Sowa, Knowledge Representation: Logical, Philosophical, and Computational Foundations

“All models are wrong, but some are useful!”

CSE-291: Ontologies in Data Integration

XML-Based vs. Model-Based Mediation

CM ~ {Descr.Logic, ER, UML, RDF/XML(-Schema), …} CM-QL ~ {F-Logic, DAML+OIL, …}

IntegratedDTD :=

XML-QL (Src1DTD ,...)

No Domain

Constraints

Glue Maps

DMs, PMs

IntegratedCM :=

CM-QL (Src1CM ,...)

IF

IF

THEN

THEN

Logical

Domain

Constraints

Structural Constraints ( DTDs ),

Parent , Child , Sibling , ...

A = (B*|C),D

B = ...

C2

C1

R C3

Classes ,

Relations , is-a , has-a , ...

XML

Elements

....

. .

....

(XML)

Objects

XML Models

Conceptual Models

What is an ontology and what is it good for?

And the answer is ...

CSE-291: Ontologies in Data Integration

Glossary (wordreference.com)

• ontology noun

1 ( Philosophy ) the branch of metaphysics that deals with the nature of being

2 ( Logic ) the set of entities presupposed by a theory

• taxonomy noun

1 a the branch of biology concerned with the classification of organisms into groups based on similarities of structure, origin, etc.

b the practice of arranging organisms in this way

2 the science or practice of classification [ETYMOLOGY: 19th Century: from French taxonomie, from

Greek taxis order + -nomy]

• thesaurus noun

(plural: -ruses , -ri [

-raı

])

1 a book containing systematized lists of synonyms and related words

2 a dictionary of selected words or topics

3 ( rare ) a treasury[ETYMOLOGY: 18th Century: from Latin, Greek: treasure]

CSE-291: Ontologies in Data Integration

Glossary (wordreference.com)

• concept noun

1 an idea, esp. an abstract idea example: the concepts of biology

2 ( Philosophy ) a general idea or notion that corresponds to some class of entities and that consists of the characteristic or essential features of the class

3 ( Philosophy ) a the conjunction of all the characteristic features of something b a theoretical construct within some theory c a directly intuited object of thought d the meaning of a predicate

4 [ modifier ] (of a product, esp. a car) created as an exercise to demonstrate the technical skills and imagination of the designers, and not intended for mass production or sale[ETYMOLOGY: 16th Century: from Latin conceptum something received or conceived, from concipere to take in, conceive ]

• contingent adjective

1 [ when postpositive, often foll by on or upon ] dependent on events, conditions, etc., not yet known; conditional

2 ( Logic ) (of a proposition) true under certain conditions, false under others; not necessary

3 (in systemic grammar) denoting contingency (sense 4)

4 ( Metaphysics ) (of some being) existing only as a matter of fact; not necessarily existing

5 happening by chance or without known cause; accidental

6 that may or may not happen; uncertain

• glossary noun (plural: -ries); an alphabetical list of terms peculiar to a field of knowledge with definitions or explanations. Sometimes called: gloss

[ETYMOLOGY: 14th Century: from Late Latin glossarium; see gloss 2 ]

CSE-291: Ontologies in Data Integration

1

st

Attempt: Ontologies in CS

• An ontology is ...

– an explicit specification of a conceptualization [Gruber93]

– a shared understanding of some domain of interest [Uschold,

Gruninger96]

• Some aspects and parameters:

– a formal specification ( reasoning and “ execution ”)

– ... of a conceptualization of a domain ( community )

– ... of some part of world that is of interest ( application )

• Provides:

– A common vocabulary of terms

– Some specification of the meaning of the terms (semantics)

– A shared understanding for people and machines

CSE-291: Ontologies in Data Integration

Ontology as a philosophical discipline

• Ontology as a philosophical discipline , which deals with the nature and the organization of reality:

– Ontology as such is usually contrasted with Epistemology , which deals with the nature and sources of our knowledge [a.k.a.

Theory of Knowledge].

Aristotle defined Ontology as the science of being as such: unlike the special sciences, each of which investigates a class of beings and their determinations,

Ontology regards all the species of being qua being and the attributes which belong to it qua being" (Aristotle, Metaphysics ,

IV, 1).

• In this sense Ontology tries to answer to the question:

What is being ?

CSE-291: Ontologies in Data Integration

Some different uses of the word “Ontology”

[Guarino’95]

1. Ontology as a philosophical discipline

2. Ontology as a an informal conceptual system

3. Ontology as a formal semantic account

4. Ontology as a specification of a “conceptualization”

5. Ontology as a representation of a conceptual system via a logical theory

5.1 characterized by specific formal properties

5.2 characterized only by its specific purposes

6. Ontology as the vocabulary used by a logical theory

7. Ontology as a (meta-level) specification of a logical theory http://ontology.ip.rm.cnr.it/Papers/KBKS95.pdf

CSE-291: Ontologies in Data Integration

Ontologies vs Conceptualizations

• Given a logical language L ...

– ... a conceptualization is a set of models of L which describes the admittable (intended) interpretations of its non-logical symbols (the vocabulary )

– ... an ontology is a (possibly incomplete) axiomatization of a conceptualization.

logic set of all models M(L) theories ontology conceptualization C(L)

[ Guarino96 ] http://www-ksl.stanford.edu/KR96/Guarino-What/P003.html

CSE-291: Ontologies in Data Integration

Ontologies vs Knowledge Bases

• An ontology is a particular KB, describing facts assumed to be always true by a community of users:

– in virtue of the agreed-upon meaning of the vocabulary used

(analytical knowledge):

• black => not white

– ... whose truth does not descend from the meaning of the vocabulary used (non-analytical, common knowledge)

• Rome is the capital of Italy

• An arbitrary KB may describe facts which are contingently true , and relevant to a particular epistemic state:

– Mr Smith’s pathology is either cirrhosis or diabetes

CSE-291: Ontologies in Data Integration

Formal Ontology [Guarino’96]

• Theory of formal distinctions

– among things

– among relations

• Basic tools

– Theory of parthood

• What counts as a part of a given entity? What properties does the part relation have? Are the different kinds of parts?

– Theory of integrity

• What counts as a whole ? In which sense are its parts connected ?

– Theory of identity

• How can an entity change while keeping its identity? What are its essential properties? Under which conditions does an entity loose its identity? Does a change of “point of view” change the identity conditions?

– Theory of dependence

• Can a given entity exist alone, or does it depend on other entities?

CSE-291: Ontologies in Data Integration

Ontology: Definition and Scope [Sowa]

• The subject of ontology is the study of the categories of things that exist or may exist in some domain. The product of such a study, called an ontology , is a catalog of the types of things that are assumed to exist in a domain of interest D from the perspective of a person who uses a language L for the purpose of talking about D .

The types in the ontology represent the predicates , word senses , or concept and relation types of the language L when used to discuss topics in the domain D . An uninterpreted logic, such as predicate calculus, conceptual graphs, or KIF, is ontologically neutral . It imposes no constraints on the subject matter or the way the subject may be characterized. By itself, logic says nothing about anything, but the combination of logic with an ontology provides a language that can express relationships about the entities in the domain of interest.

http://users.bestweb.net/~sowa/ontology/index.htm

CSE-291: Ontologies in Data Integration

Ontology: Definition and Scope [Sowa]

• An informal ontology may be specified by a catalog of types that are either undefined or defined only by statements in a natural language. A formal ontology is specified by a collection of names for concept and relation types organized in a partial ordering by the type-subtype relation. Formal ontologies are further distinguished by the way the subtypes are distinguished from their supertypes: an axiomatized ontology distinguishes subtypes by axioms and definitions stated in a formal language, such as logic or some computer-oriented notation that can be translated to logic; a prototype-based ontology distinguishes subtypes by a comparison with a typical member or prototype for each subtype. Large ontologies often use a mixture of definitional methods: formal axioms and definitions are used for the terms in mathematics, physics, and engineering; and prototypes are used for plants, animals, and common household items. .

http://users.bestweb.net/~sowa/ontology/index.htm

CSE-291: Ontologies in Data Integration

Why develop an ontology?

• To make domain assumptions explicit

– Easier to change domain assumptions

– Easier to understand, update, and integrate legacy data

 data integration

• To separate domain knowledge from operational knowledge

– Re-use domain and operational knowledge separately

• A community reference for applications

• To share a consistent understanding of what information means.

[Carole Goble, Nigel Shadbolt, Ontologies and the Grid Tutorial]

CSE-291: Ontologies in Data Integration

What is being shared?

Metadata

• Data describing the content and meaning of resources and services.

• But everyone must speak the same language…

Terminologies

• Shared and common vocabularies

• For search engines, agents, curators, authors and users

• But everyone must mean the same thing…

Ontologies

Shared and common understanding of a domain

• Essential for search, exchange and discovery

Ontologies aim at sharing meaning

[Carole Goble, Nigel Shadbolt, Ontologies and the Grid Tutorial]

CSE-291: Ontologies in Data Integration

Origin and History

• Humans require words (or at least symbols) to communicate efficiently. The mapping of words to things is indirect. We do it by creating concepts that refer to things.

• The relation between symbols and things has been described in the form of the meaning triangle :

Concept

“Jaguar“

Ogden, C. K. & Richards, I. A. 1923. "The Meaning of Meaning." 8th Ed. New York, Harcourt, Brace &

World, Inc before: Frege, Peirce; see [Sowa 2000]

CSE-291: Ontologies in Data Integration

[Carole Goble, Nigel Shadbolt, Ontologies and the Grid Tutorial]

• ...

Agent 1

Human and machine communication

Human

Agent 2

Machine

Agent 1

[Maedche et al., 2002]

Machine

Agent 2

HA1 exchange symbol, e.g. via nat. language

‘‘JAGUAR“

Ontology

Description exchange symbol, e.g. via protocols

Internal models

HA2 commit commit

Formal Semantics

Ontology commit

MA1 commit

Formal models

MA2 a specific domain, e.g.

animals

Symbol

Concept Meaning

Triangle

Things

CSE-291: Ontologies in Data Integration

An explicit description of a domain

Concepts

(class, set, type, predicate)

– event, gene, gammaBurst, atrium, molecule, cat

Properties of concepts and relationships between them

(slot)

Taxonomy : generalisation ordering among concepts isA , partOf , subProcess

Relationship , Role or Attribute : functionOf , hasActivity location, eats , size vermin rodent mouse animal domestic cat eats dog cow

[Carole Goble, Nigel Shadbolt, Ontologies and the Grid Tutorial]

CSE-291: Ontologies in Data Integration

Concepts

• Primitive concepts:

– properties are necessary

– Globular protein must have hydrophobic core (but a protein with a hydrophobic core need not be a globular protein)

• Defined concepts:

– properties are necessary + sufficient

– Eukaryotic cells must have a nucleus.

– Every cell that contains a nucleus must be Eukaryotic.

[Robert Stevens]

CSE-291: Ontologies in Data Integration

What is a concept?

Different communities have different notions on what a concept means:

– Formal concept analysis (see http://www.math.tudresden.de/~ganter/fba.html

) talk about formal concepts

– Description Logics (see http://dl.kr.org/ ): They talk about concept labels

– ISO-704:2000 – Terminology Work: (see http://www.iso.ch/ )

– Often the classical notion of a frame in AI or a class in OO modeling is seen as equivalent to a concept.

CSE-291: Ontologies in Data Integration

Formal Concept Analysis (FCA)

Formal Concept Analysis

[Sowa, http://users.bestweb.net/~sowa/misc/mathw.htm]

CSE-291: Ontologies in Data Integration

Concept Lattice

An explicit description of a domain

• Constraints or axioms on properties and concepts:

– value: integer

– domain: cat

– cardinality: at most 1

– range: 0 <= X <= 100

– oligonucleiotides < 20 base pairs

– cows are larger than dogs

– cats cannot eat only vegetation

– cats and dogs are disjoint vermin

Values or concrete domains

– integer, strings

– 20, trypotoplan-synthetase rodent mouse animal domestic cat eats dog cow

[Carole Goble, Nigel Shadbolt, Ontologies and the Grid Tutorial]

CSE-291: Ontologies in Data Integration

An explicit description of a domain

Individuals or Instances

– sulphur, trpA Gene, felix

Nominals

– Concepts that cannot have instances

– Instances that are used in conceptual definitions

– ItalianDog = Dog bornIn Italy

• Instances

– An ontology = concepts+properties+axioms+values+nominals

– A knowledge base = ontology+instances

CSE-291: Ontologies in Data Integration vermin rodent mouse animal domestic cat eats felix dog cow tom mickey jerry

[Carole Goble, Nigel Shadbolt, Ontologies and the Grid Tutorial]

Light and Heavy expressivity

A matter of rigour and representational expressivity

• Lightweight

– Concepts, atomic types

– Is-a hierarchy

– Relationships between concepts

• Heavyweight

– Metaclasses

– Type constraints on relations

– Cardinality constraints

– Taxonomy of relations

– Reified statements

– Axioms

– Semantic entailments

– Expressiveness

– Inference systems

[Carole Goble, Nigel Shadbolt, Ontologies and the Grid Tutorial]

CSE-291: Ontologies in Data Integration

[Mike Uschold, Boeing Corp]

A semantic continuum

Shared human consensus

Pump: “a device for moving a gas or liquid from one place or container to another”

Text descriptions

(pump has

(superclasses (…))

Semantics hardwired; used at runtime

Semantics processed and used at runtime

Implicit Informal

(explicit)

Further to the right means:

•Less ambiguity

•More likely to have correct functionality

•Better inter-operation (hopefully)

Formal

(for humans)

Formal

(for machines)

•Less hardwiring

•More robust to change

•More difficult

CSE-291: Ontologies in Data Integration

Some Ontologies (and Friends) in

Action

(coming soon to a project near you)

CSE-291: Ontologies in Data Integration

Midatlantic Region

Rocky Mountains

GEON Architecture

CSE-291: Ontologies in Data Integration

SMART (Meta)data I: Logical Data Views

CSE-291: Ontologies in Data Integration

Adoption of a standard (meta)data model => wrap data sets into unified virtual views

Source: NADAM Team

(Boyan Brodaric et al.)

SMART Metadata II: Multihierarchical Rock Classification for “Thematic

Queries” (GSC) –– or: Taxonomies are not only for biologists ...

Genesis

Fabric

Composition

CSE-291: Ontologies in Data Integration

“smart discovery & querying” via multiple, independent concept hierarchies (controlled vocabularies)

• data at different description levels can be found and processed

Texture

SMART Metadata III: Source

Contextualization & Ontology Refinement

Biomedical

Informatics

Research Network http://nbirn.net

CSE-291: Ontologies in Data Integration

Focused GEON ontology working meeting last week ... (GEON, SCEC/KR, GSC, ESRI)

EcoCyc

CSE-291: Ontologies in Data Integration

Gene Ontology

http://www.geneontology.org

“a dynamic controlled vocabulary that can be applied to all eukaryotes”

Built by the community for the community.

Three organising principles:

Molecular function, Biological process, Cellular component

Isa and Part of taxonomy – but not good!

~10,000 concepts

Lightweight ontology, Poor semantic rigour. Ok when small and used for annotation. Obstacle when large, evolving and used for mining.

CSE-291: Ontologies in Data Integration

Controlled vocabulary

• AGROVOC: Agricultural Vocabulary

CSE-291: Ontologies in Data Integration

Thesauri

• AAT: Art & Architecture Thesaurus

CSE-291: Ontologies in Data Integration

One Formalism:

Description Logics

(aka terminological logics , member of concept languages )

CSE-291: Ontologies in Data Integration

Formalism for Ontologies: Description Logic

• DL definition of “Happy Father”

(Example from Ian Horrocks, U Manchester, UK)

CSE-291: Ontologies in Data Integration

Description Logic Statements as Rules

• Another syntax: first-order logic in rule form: happyFather(X)

 man(X), child(X,C1), child(X,C2), blue(C1), green(C2), not ( child(X,C3), poorunhappyChild(C3) ).

poorunhappyChild(C)

 not rich(C), not happy(C).

• Note:

– the direction “  ” is implicit here (*sigh*)

– see, e.g., Clark’s completion in Logic Programming

CSE-291: Ontologies in Data Integration

Description Logics

• Terminological Knowledge (TBox)

– Concept Definition (naming of concepts):

– Axiom (constraining of concepts):

=> a mediators “glue knowledge source”

• Assertional Knowledge (ABox)

– the marked neuron in image 27

=> the concrete instances/individuals of the concepts/classes that your sources export

CSE-291: Ontologies in Data Integration

Querying vs. Reasoning

• Querying:

– given a DB instance I (= logic interpretation ), evaluate a query expression (e.g. SQL, FO formula, Prolog program, ...)

– boolean query : check if I |=

(i.e., if I is a model of

)

– (ternary) query : { (X, Y, Z) | I |=

(X,Y,Z) }

=> check happyFathers in a given database

• Reasoning:

– check if

I |=

 implies I |=

 for all databases I ,

– i.e., if 

=>

– undecidable for FO, F-logic, etc.

– Descriptions Logics are decidable fragments

 concept subsumption, concept hierarchy, classification

 semantic tableaux, resolution, specialized algorithms

CSE-291: Ontologies in Data Integration

Formalizing Glue Knowledge:

Domain Map for SYNAPSE and NCMIR

Domain Map

= labeled graph with concepts ("classes") and roles ("associations")

• additional semantics: expressed as logic rules

Domain Map (DM)

CSE-291: Ontologies in Data Integration

Purkinje cells and Pyramidal cells have dendrites that have higher-order branches that contain spines.

Dendritic spines are ion (calcium) regulating components.

Spines have ion binding proteins. Neurotransmission involves ionic activity (release). Ion-binding proteins control ion activity (propagation) in a cell. Ion-regulating components of cells affect ionic activity (release).

Domain Expert Knowledge

DM in Description Logic

Source Contextualization & DM Refinement

In addition to registering

(“hanging off”) data relative to existing concepts, a source may also refine the mediator’s domain map...

CSE-291: Ontologies in Data Integration

 sources can register new

concepts at the mediator ...

Download