Ontologies and Data Models - NCOR: National Center for

advertisement
Semantic Enhancement, Data Models and Annotations
B. Smith, T. Malyuta
03/19/2013
Ontologies and Data Models
The Semantic Enhancement (SE) strategy is based on the use of ontologies to integrate and semantically
enhance data models – the enhancement is performed by annotating (tagging) the models by the terms
of ontology [1-3]. Thus it is important to understand what the difference is between an ontology and a
data model.
We want to emphasize that while in our comparison of ontologies and data models we assume usage of
respective technologies, the comparison is not about the technologies but about the paradigms.
Therefore, not every RDF or OWL system is an ontology in the sense we describe here.
Ontologies are in first approximation controlled vocabularies (structured lists of terms) together with
definitions, which specify the meanings (the ‘semantics’) of their terms. These definitions are in some
ways analogous to the definitions provided in a dictionary (ontologies do sometimes play the role of
computational dictionaries, for example in serving educational purposes). But dictionaries are
traditionally focused on the usage of terms within a natural language; hence they need to define all the
meanings associated with a given term by a given linguistic community. Ontologies, in contrast, are
focused on providing a controlled vocabulary for talking about the types of entities and relations within
a given domain – types, such as electron or cell – of the sort described in scientific texts. Hence they
must provide one term, and one definition, for each salient type of entity in each domain of interest.
(Here entities may include not only physical things and physical events including information artifacts
such as databases and text files, but also immaterial entities, such as the ideas, beliefs and plans in
peoples’ heads, laws, epidemics, military operations, property rights, national borders, credit default
obligations, and so on.)
Types
Data-model designers, too, use types, but they see types not as entities in reality but rather as
abstractions embodying efficient ways of describing the data about reality that is needed by an
application (efficient both for reasoning and for storage). Figure 1 illustrates on the left an ontology
fragment, and on the right a sample of different possible data models drawing on the same repertoire of
types and using the same labels.
In both ontologies and data models, types are general or repeatable entities capable of being
instantiated by indefinitely many particulars. In ontologies, however, the types and instances are on the
side of reality; in data-models, the types are data abstractions on the side of particular representations
of reality. Thus the ontology term ‘person’, when it is used to annotate data about persons, is designed
to establish a link between these data and persons in reality. The data model term ‘person’, in contrast,
is used to define an efficient storage solution for data about persons needed by a particular application.
Person
Name
Skill
Computer
Skill
Network
Skill
Person
Person Name
Network Skill
Programming Skill
First Name
Skill
PersonSkill
Programming
Skill
First Middle Last Nick
Name Name Name Name
Last Name
Skills
Person Name
Computer Skill
Java
Skill
Figure 1: Example SE ontologies, with their constrained hierarchies (LEFT), and data-models,
in which terms are combined in multiple ways in different data tables (RIGHT).
Organization
The Figure 1 shows how, when ontologies are properly constructed, each term used to describe data
appears only once in the ontology hierarchy. The ontology view of reality is synoptic – it represents in
non-redundant fashion an entire hierarchy of types at different levels of generality. Each term is
associated in an intelligible way with its subsuming and subsumed terms (and thus with the ancestor
and descendant types) in the hierarchy of more and less general.
Each data-model, in contrast, is flat and represents arbitrary combinations of types suitable for
providing efficient data processing (in particular, consistency of data and good performance). For
example, consider that the first data model of Error! Reference source not found.Figure 1 has to be
changed to represent the first and the second names separately, or the second model of the same figure
that needs to be changed to the model of Error! Reference source not found. to accommodate multiple
skills of a person. These changes can be performed only through significant effort because of relative
rigidity of data representation languages and the need to re-arrange the physical data store.
Person
PersonID
1
2
Skill
Name
John
Mary
SkillID
91
92
Person-Skill
Name
Java
C++
PersonID
1
1
2
SkillID
91
92
91
Figure 2. A traditional organization of data in a database.
Labels
Ontologies and models also differ in how they label their types. Ontologies, as mentioned above, use
nouns and noun phrases from natural language, and each type has a unique name that designates the
type unambiguously regardless of the context in which the type might be used. In databases, the labels
of types are not as important because, unlike ontologies, databases are not directly exposed to users –
they are presented via an application that exposes the database content using the specific vocabulary of
a narrow community of users. Also, the meaning of the data model label is often derived from the
context. For example, Error! Reference source not found. illustrates that both the type for the name of
a person and for the name of a skill have the same label ‘Name’; the meaning of the label is derived
from the table: Person and Skill respectively. In Figure 1, for the purpose of comparison of ontologies
and data-models here we tried to use the same labels for the same types in both ontologies and models,
but in reality the labels of data models can be anything, e.g. ‘PN’, ‘PName’, ‘PersName’, ‘PersonN’, etc.
for the person name.
Goals
The main goal of data model construction is facility in the handling of specific data. The main goal of SE
ontology construction is objectivity of representation. If the same ontologies can be used to annotate
multiple different sorts of this data, this promotes shareability of the data across many communities of
interest.
Data models are created in ad hoc ways to capture targeted selections of features that are selected
because they are needed for a specific task. The SE strategy, in contrast, requires that there be exactly
one authoritative ontology for each salient domain of reality. This is because, to create a consistently
cumulating body of annotations, the same ontologies must be reused over and over again in application
to different data resources and in support of different analyst communities. When such reuse is
achieved across heterogeneous and heterogeneously sourced bodies of data, these data can be pooled
as they grow over time.
The ontologies will grow and expand as new knowledge is gained over time. As we shall see below, data
models are, in contrast, inflexible.
Instances
There is a difference in perception and representation of instances or individuals (e.g. values of columns
of a table in the relational database) in data stores and ontologies. In some cases, what is a type in an
ontology can be used as an instance in a database. Error! Reference source not found. illustrates that a
type of practically any level of generality in an ontology can be used as an instance in a database.
PersonSkill
Person
Skill
Person
Name
John Smith
Network
Skill
CISCO
Program
mingSkill
Java
Last
Name
Smith
First
Name
John
Skill
Adam Bates
Unix
C++
Bates
Java
Johns
Nick Johns
ComputerSkill
ComputerSkill
Person
Name
John Smith
Adam
WritingSkill
Adam Bates
NetworkSkill
NIck
ComputerSkill
ProgrammingSkill
Figure 3. Databases with different granularity of data.
Furthermore, the database does not distinguish between the same occurrences of a particular type (e.g.
attribute in databases): Java is Java is Java for John and Nick in the table Person.
The Closed and Open World Assumptions
Each data model is ‘closed’ in the sense that it defines a limited world that we can describe completely
simply by examining the data representations that have been used in its specification. If, for example,
there is no field for gender in a given data model created for the recording of data about persons, then
there is no gender in the world defined by this data model.
Database reasoning is confined to search based on this closed world assumption. If we do not find
something in the database, then this means that this something does not exist in the world that is
defined by the database. If we do not find an assertion to the effect that F holds of a in the database,
then we must assume that F does not hold of a. If this leads to unwelcome consequences then we must
change the database representation.
SE ontologies, in contrast, are in the following sense ‘open’: they exploit a logical framework which is
based on the idea that we can never describe entities in the real world completely. This means that,
from the absence in an SE ontology of a particular term ‘A’, we cannot infer that As do not exist. It
means also that ontologies are constructed in a way which allows easy addition of new types and
relations. In the section on organization of ontologies and data models we gave examples of difficulties
of data model changes; analogous changes to ontologies can be done easily. If ontologies are created
within an appropriate governance framework to ensure consistency and non-redundancy of semantics,
then the use of OWL also allows flexible merging of ontologies for specific purposes, and flexible import
of portions of existing ontology terms into other ontologies – a task that is very difficult and often
impossible to perform on data models. Because ontologies are more flexible and easier to expand in
these ways, the SE approach is also more stable than many common approaches based on data-model,
which typically involve the building of further data models, to which existing data models then somehow
have to be mapped. It is not necessary to build a new ontology every time your data changes. On the
other hand, we can develop and improve our ontology without any changes to the data. This, too, is a
feature which is exploited when we use ontologies for data integration.
For example, changes to the data models of Figure 1 that were discussed above, do not require any
disruptive changes to the ontologies: in the case when we discover that we need to have the first and
the last names of person, we add these as subtypes to the type PersonName as the figure illustrates.
Dimension of
Traditional Data-Model
Comparison
Focus
SE Ontologies
On types defined in a particular data On types in reality
representation
Closeness to reality
Conceptualization
Variable, application-specific
of Plain and partial (always at the level of Hierarchical,
the domain (see Fig. 1)
detail
needed
implementation)
Vocabulary
for
a
simultaneously
particular describing the same domain at
different levels of detail
Application-specific, not intended for Application-independent, intended to
sharing
Structures
Reality is always the prime focus
support sharing and distributed reuse
or Groupings of types to accommodate Taxonomies (type hierarchies) always
organization of types
data access patterns (e.g. relations in used to describe/classify the domain
the RDB)
Combinability
Distinct Data models can rarely be If the SE methodology is followed
combined; even where combination is when ontologies are constructed,
possible this will typically require then the results will be combinable
significant manual effort
Flexibility
automatically
Changes in data-models, for example Changes can normally be effected
to incorporate new types of data, very easily.
normally require significant effort
However, ontologies are not a substitute for data models – if we need to store data using a particular
data storage solution (currently, the most widely used technologies are relational databases and Cloud
stores), we must do this with the help of some data model (i.e. a storage model). While a data model
can be built using the terms of an ontology, the terms of an ontology will be organized in a data model
in such a way as to support an efficient and consistent data store – most probably, differently from how
they are organized in the ontology (data models of Figure 1 and Error! Reference source not found.
illustrate this).
Semantic Enhancement, and Über Ontology or Über Data Model
The characteristics of data models discussed above imply that it will be impossible to build an Übermodel, that is a model that would represent everything in some complex domain such as biology or
military operations.
Über Data Model
First, any data model will represent only a particular stratum of reality at a particular level of granularity
or detail, and it will not allow the representation of things described by a respective ontology on lower
levels. For example, the second model of the Figure 1 allows storing categories of skills like
Programming Skill, Network Sill, etc. If we want to represent for example particular programming skills,
we may need to use the model of Error! Reference source not found. to be able to represent person’s
multiple skills of the same category, e.g. Programming Skills Java and SQL.
Then, organization of a model is defined by a particular data usage patterns – it will not contain data not
used by these patterns as in the explanation above and it will not allow to efficiently process existing
data through other access patterns. Therefore, even if we could create a more comprehensive and more
detailed model, we will not be able to efficiently use data implemented in this model. For example, the
model of Error! Reference source not found. is more comprehensive that the second model of Figure 1,
however, the performance of queries about person’s name and skills (that will require joining the tables)
may become a problem. The more general the model will be, the less usable it will be – which defies the
purpose of the Über-model.
Moreover, if there is a need to change an Über-model, e.g. because the realty it represents has changed,
such a change is often very difficult and sometimes practically impossible to implement (data is tightly
coupled with models, and changing data models entails disturbing/rebuilding associated data, often big
volumes of it) – the model is not an Uber model anymore. Additionally, an Uber model requires a
separate data store that will accommodate data from other stores (usually with some loss and or
distortion of data or data semantics). Implementation of such a store is a complex and expensive
endeavor that, as the discussion above shows, is doomed to have a limited life. There is a broad
consensus that it is impossible to have such a model, and it is impractical to try to get it.
Uber Ontology
Ontologies are more flexible than data models. For example, we can introduce representations at new
levels of generality by adding branches to an existing taxonomy or we can add new relationships
between existing ontology terms. And often we can do all of this without disturbing the bodies of data
(i.e. representations of instances) associated with the ontology. It is easy to add new branches to an
ontology, or new sub-branches of existing branches.
Nevertheless, there is a broad consensus that building an Über-ontology is impossible.
While the SE strategy rests on the construction of a suite of ontologies, it does not mean that it rests on
the use of some Über-ontology. The SE suite of ontologies is a suite of ontological modules that can
evolve in a cumulative incremental fashion. There are two types of ontologies: reference and
application. Reference ontologies represent shared understanding of domains and are used in creation
of various application-specific ontologies. Thus, the suite of ontologies can be extended indefinitely to
cover any domain and any application-specific perception of the domain. We believe that with the help
of the suggested SE architecture, the methodology of the SE development, and the governance
principles that are described in a number of publications it is possible to build a suite of ontologies that
can be shared by a large community.
The Best of Two Worlds
Relationship Between Ontologies and Data Models
Ontologies and data models are relatively independent bodies of data and semantics. When working on
either one, however, it is beneficial to consider the other. Ontologies and data models can benefit from
each other in the following ways:

Data models provide a detailed application-specific representation of a particular domain and of
how given data is perceived/processed by its users. This domain knowledge will be one of the
domain knowledge sources for ontologies. It will be particularly useful for application-specific
ontologies.

The process of annotating data from different data sources using common ontologies is guided by
the data models used: ontology terms are applied at arm’s length to the terms of the data model,
and the organization of these terms (data structures) usually remains as it is in the data model.

Ontologies provide a comprehensive and formal representation of a domain that can encompass the
model development. In particular, ontologies provide and explain domain terms, relations and
dependencies between the terms, which will help to understand the domain and also will ensure
that different data models interpret the terms in the same way.

We can think of ontology-assisted model building, especially of data models in an agile
fashion that will at the same time support horizontal integration – a capability that is
very important in today’s dynamic environments.

Ontologies provide a shared vocabulary that should be used in HCI applications (not
even necessarily data models as we can annotate them) and in data collection.
However:

Ontology cannot resolve data model problems (of design and or implementation).

Having an ontology of a domain(s) does not eliminate the need to build data models when
necessary. Moreover, ontologies do not give answers (at least not all answers) as to what the data
model should be. Models (in contrast to ontologies that are objective representations of the
domain) are application-oriented and their content/structure can be different (and usually is) from
those of ontologies. A simple example: we have ontologies of Person, Person Name, Person
Identification, Skill, and Address. We also have relations between the terms of these ontologies. One
group needs to build a database of their employees’ skills, in particular, of the Person ID, Person
Name, Skill Code, and the Skill (only one) that is important to them. Another group wants to build a
database about Persons with their SSN, Name, Address and all known Skills with the Skill Code and
Skill. We have two distinctly different database models (primary keys are underlined, foreign keys
are in italic). In the first case we have two tables: Person (PersonID, PersonName, SkillCode) and Skill
(SkillCode, Skill). In the second case we have three tables: Person (SSN, PersonName, Address), Skill
(SkillCode, Skill), and PersonSkill (SSN, SkillCode). These models represent the mentioned domains
in different ways, and both these representations would be different from the ontology
representation. While automated model building based on ontologies does not seem
feasible in general, we mentioned above that we can have ontology-assisted (semiautomated) model building.
Leveraging Ontologies and Data Models
One of the important uses of ontologies is data integration1. This is the main goal of the SE ontologies.
Using the SE strategy provides a light-weight, flexible and inexpensive (as it does not require rebuilding
data stores) solution to the problem of data integration. It applies ontologies to the data stores at arm’s
length: the stores themselves continue to exist as they and thus are able to serve the purposes for which
they were implemented. Thus in particular the SE strategy does not change the organization of data;
instead, it annotates the types of the source models or ontologies by the type labels of the SE.
For example, if for the sources of Figure 1 we annotate Name by PersonName and Skill by SkillName, the
data from the database will be exposed using these labels but with preserving the relationships from the
data model (the structure of queries and therefore the querying infrastructure, e.g. indexes, will remain
unchanged). We can imagine a query
SELECT Person.PID, Person.Name, Skill.Skill
FROM Person p, Skill s, Person-Skill ps
WHERE p.PID = ps.PID AND s.SkillID = ps.SkillID AND p.PID = 123
To be presented as (SE labels are in bold)
SELECT Person.PID, PersonName, SkillName
FROM Person p, Skill s, Person-Skill ps
WHERE p.PID = ps.PID AND s.SkillID = ps.SkillID AND p.PID = 123
1
We understand integration as the ability to process (including, query and search) heterogeneous data as if it were
homogeneous and understandable to the users.
Even in the simplest case of the SE, when we use only the SE vocabulary for annotations, these
annotations provide real enhancement and enrichment of the data. Figure 4 illustrates how on the
string of an annotation, without any change to the original data store, we put on the top of a
database field the whole knowledge system. Not only can analysts analyze the data about computer
skills vertically along the Skill hierarchy (Figure 1), they can analyze it also horizontally for example
via relations between Skill and Education, and further… For example, across different sources the
analysts can ask for data about skills, computer skills, programming skills. They also can ask about
particular education even if the data source does not contain anything about the education.
Thus even though data in the database does not change, its analysis can become richer and richer
through time, as our understanding of the corresponding domain of reality changes and thereby
brings improvements in the ontologies used for annotation.
SkillID
91
92
Skill
Java
C++
Figure 4. Annotation of a data model.
References
1. B. Smith. Methodology for Semantic Enhancement of Intelligence Data (white paper)
http://ncor.buffalo.edu/SE/SE_Methodology_6_30_2012.pdf
2. Smith B., Malyuta T., Mandrick W., Fu C., Parent K., Patel M. Horizontal Integration of
Warfighter Intelligence Data: A Shared Semantic Resource for the Intelligence Community,
STIDS Conference, 2012.
http://stids.c4i.gmu.edu/papers/STIDSPapers/STIDS2012_T14_SmithEtAl_HorizontalIntegra
tionOfWarfighterIntel.pdf
3. Smith B., Malyuta T., Salmen D., Mandrick W., Parent K., Bardhan S., Johnson J. “Ontology
for the Intelligence Analyst”, Crosstalk: The Journal of Defense Software Engineering, 2012.
4. Salmen D., Malyuta T., Hansen A., Cronen S., Smith B.. Integration of Intelligence Data
through Semantic Enhancement, STIDS Conference, 2011.
http://stids.c4i.gmu.edu/STIDS2011/papers/STIDS2011_CR_T1_SalmenEtAl.pdf
Download