Semantic Enhancement, Data Models and Annotations B. Smith, T. Malyuta 03/19/2013 Ontologies and Data Models The Semantic Enhancement (SE) strategy is based on the use of ontologies to integrate and semantically enhance data models – the enhancement is performed by annotating (tagging) the models by the terms of ontology [1-3]. Thus it is important to understand what the difference is between an ontology and a data model. We want to emphasize that while in our comparison of ontologies and data models we assume usage of respective technologies, the comparison is not about the technologies but about the paradigms. Therefore, not every RDF or OWL system is an ontology in the sense we describe here. Ontologies are in first approximation controlled vocabularies (structured lists of terms) together with definitions, which specify the meanings (the ‘semantics’) of their terms. These definitions are in some ways analogous to the definitions provided in a dictionary (ontologies do sometimes play the role of computational dictionaries, for example in serving educational purposes). But dictionaries are traditionally focused on the usage of terms within a natural language; hence they need to define all the meanings associated with a given term by a given linguistic community. Ontologies, in contrast, are focused on providing a controlled vocabulary for talking about the types of entities and relations within a given domain – types, such as electron or cell – of the sort described in scientific texts. Hence they must provide one term, and one definition, for each salient type of entity in each domain of interest. (Here entities may include not only physical things and physical events including information artifacts such as databases and text files, but also immaterial entities, such as the ideas, beliefs and plans in peoples’ heads, laws, epidemics, military operations, property rights, national borders, credit default obligations, and so on.) Types Data-model designers, too, use types, but they see types not as entities in reality but rather as abstractions embodying efficient ways of describing the data about reality that is needed by an application (efficient both for reasoning and for storage). Figure 1 illustrates on the left an ontology fragment, and on the right a sample of different possible data models drawing on the same repertoire of types and using the same labels. In both ontologies and data models, types are general or repeatable entities capable of being instantiated by indefinitely many particulars. In ontologies, however, the types and instances are on the side of reality; in data-models, the types are data abstractions on the side of particular representations of reality. Thus the ontology term ‘person’, when it is used to annotate data about persons, is designed to establish a link between these data and persons in reality. The data model term ‘person’, in contrast, is used to define an efficient storage solution for data about persons needed by a particular application. Person Name Skill Computer Skill Network Skill Person Person Name Network Skill Programming Skill First Name Skill PersonSkill Programming Skill First Middle Last Nick Name Name Name Name Last Name Skills Person Name Computer Skill Java Skill Figure 1: Example SE ontologies, with their constrained hierarchies (LEFT), and data-models, in which terms are combined in multiple ways in different data tables (RIGHT). Organization The Figure 1 shows how, when ontologies are properly constructed, each term used to describe data appears only once in the ontology hierarchy. The ontology view of reality is synoptic – it represents in non-redundant fashion an entire hierarchy of types at different levels of generality. Each term is associated in an intelligible way with its subsuming and subsumed terms (and thus with the ancestor and descendant types) in the hierarchy of more and less general. Each data-model, in contrast, is flat and represents arbitrary combinations of types suitable for providing efficient data processing (in particular, consistency of data and good performance). For example, consider that the first data model of Error! Reference source not found.Figure 1 has to be changed to represent the first and the second names separately, or the second model of the same figure that needs to be changed to the model of Error! Reference source not found. to accommodate multiple skills of a person. These changes can be performed only through significant effort because of relative rigidity of data representation languages and the need to re-arrange the physical data store. Person PersonID 1 2 Skill Name John Mary SkillID 91 92 Person-Skill Name Java C++ PersonID 1 1 2 SkillID 91 92 91 Figure 2. A traditional organization of data in a database. Labels Ontologies and models also differ in how they label their types. Ontologies, as mentioned above, use nouns and noun phrases from natural language, and each type has a unique name that designates the type unambiguously regardless of the context in which the type might be used. In databases, the labels of types are not as important because, unlike ontologies, databases are not directly exposed to users – they are presented via an application that exposes the database content using the specific vocabulary of a narrow community of users. Also, the meaning of the data model label is often derived from the context. For example, Error! Reference source not found. illustrates that both the type for the name of a person and for the name of a skill have the same label ‘Name’; the meaning of the label is derived from the table: Person and Skill respectively. In Figure 1, for the purpose of comparison of ontologies and data-models here we tried to use the same labels for the same types in both ontologies and models, but in reality the labels of data models can be anything, e.g. ‘PN’, ‘PName’, ‘PersName’, ‘PersonN’, etc. for the person name. Goals The main goal of data model construction is facility in the handling of specific data. The main goal of SE ontology construction is objectivity of representation. If the same ontologies can be used to annotate multiple different sorts of this data, this promotes shareability of the data across many communities of interest. Data models are created in ad hoc ways to capture targeted selections of features that are selected because they are needed for a specific task. The SE strategy, in contrast, requires that there be exactly one authoritative ontology for each salient domain of reality. This is because, to create a consistently cumulating body of annotations, the same ontologies must be reused over and over again in application to different data resources and in support of different analyst communities. When such reuse is achieved across heterogeneous and heterogeneously sourced bodies of data, these data can be pooled as they grow over time. The ontologies will grow and expand as new knowledge is gained over time. As we shall see below, data models are, in contrast, inflexible. Instances There is a difference in perception and representation of instances or individuals (e.g. values of columns of a table in the relational database) in data stores and ontologies. In some cases, what is a type in an ontology can be used as an instance in a database. Error! Reference source not found. illustrates that a type of practically any level of generality in an ontology can be used as an instance in a database. PersonSkill Person Skill Person Name John Smith Network Skill CISCO Program mingSkill Java Last Name Smith First Name John Skill Adam Bates Unix C++ Bates Java Johns Nick Johns ComputerSkill ComputerSkill Person Name John Smith Adam WritingSkill Adam Bates NetworkSkill NIck ComputerSkill ProgrammingSkill Figure 3. Databases with different granularity of data. Furthermore, the database does not distinguish between the same occurrences of a particular type (e.g. attribute in databases): Java is Java is Java for John and Nick in the table Person. The Closed and Open World Assumptions Each data model is ‘closed’ in the sense that it defines a limited world that we can describe completely simply by examining the data representations that have been used in its specification. If, for example, there is no field for gender in a given data model created for the recording of data about persons, then there is no gender in the world defined by this data model. Database reasoning is confined to search based on this closed world assumption. If we do not find something in the database, then this means that this something does not exist in the world that is defined by the database. If we do not find an assertion to the effect that F holds of a in the database, then we must assume that F does not hold of a. If this leads to unwelcome consequences then we must change the database representation. SE ontologies, in contrast, are in the following sense ‘open’: they exploit a logical framework which is based on the idea that we can never describe entities in the real world completely. This means that, from the absence in an SE ontology of a particular term ‘A’, we cannot infer that As do not exist. It means also that ontologies are constructed in a way which allows easy addition of new types and relations. In the section on organization of ontologies and data models we gave examples of difficulties of data model changes; analogous changes to ontologies can be done easily. If ontologies are created within an appropriate governance framework to ensure consistency and non-redundancy of semantics, then the use of OWL also allows flexible merging of ontologies for specific purposes, and flexible import of portions of existing ontology terms into other ontologies – a task that is very difficult and often impossible to perform on data models. Because ontologies are more flexible and easier to expand in these ways, the SE approach is also more stable than many common approaches based on data-model, which typically involve the building of further data models, to which existing data models then somehow have to be mapped. It is not necessary to build a new ontology every time your data changes. On the other hand, we can develop and improve our ontology without any changes to the data. This, too, is a feature which is exploited when we use ontologies for data integration. For example, changes to the data models of Figure 1 that were discussed above, do not require any disruptive changes to the ontologies: in the case when we discover that we need to have the first and the last names of person, we add these as subtypes to the type PersonName as the figure illustrates. Dimension of Traditional Data-Model Comparison Focus SE Ontologies On types defined in a particular data On types in reality representation Closeness to reality Conceptualization Variable, application-specific of Plain and partial (always at the level of Hierarchical, the domain (see Fig. 1) detail needed implementation) Vocabulary for a simultaneously particular describing the same domain at different levels of detail Application-specific, not intended for Application-independent, intended to sharing Structures Reality is always the prime focus support sharing and distributed reuse or Groupings of types to accommodate Taxonomies (type hierarchies) always organization of types data access patterns (e.g. relations in used to describe/classify the domain the RDB) Combinability Distinct Data models can rarely be If the SE methodology is followed combined; even where combination is when ontologies are constructed, possible this will typically require then the results will be combinable significant manual effort Flexibility automatically Changes in data-models, for example Changes can normally be effected to incorporate new types of data, very easily. normally require significant effort However, ontologies are not a substitute for data models – if we need to store data using a particular data storage solution (currently, the most widely used technologies are relational databases and Cloud stores), we must do this with the help of some data model (i.e. a storage model). While a data model can be built using the terms of an ontology, the terms of an ontology will be organized in a data model in such a way as to support an efficient and consistent data store – most probably, differently from how they are organized in the ontology (data models of Figure 1 and Error! Reference source not found. illustrate this). Semantic Enhancement, and Über Ontology or Über Data Model The characteristics of data models discussed above imply that it will be impossible to build an Übermodel, that is a model that would represent everything in some complex domain such as biology or military operations. Über Data Model First, any data model will represent only a particular stratum of reality at a particular level of granularity or detail, and it will not allow the representation of things described by a respective ontology on lower levels. For example, the second model of the Figure 1 allows storing categories of skills like Programming Skill, Network Sill, etc. If we want to represent for example particular programming skills, we may need to use the model of Error! Reference source not found. to be able to represent person’s multiple skills of the same category, e.g. Programming Skills Java and SQL. Then, organization of a model is defined by a particular data usage patterns – it will not contain data not used by these patterns as in the explanation above and it will not allow to efficiently process existing data through other access patterns. Therefore, even if we could create a more comprehensive and more detailed model, we will not be able to efficiently use data implemented in this model. For example, the model of Error! Reference source not found. is more comprehensive that the second model of Figure 1, however, the performance of queries about person’s name and skills (that will require joining the tables) may become a problem. The more general the model will be, the less usable it will be – which defies the purpose of the Über-model. Moreover, if there is a need to change an Über-model, e.g. because the realty it represents has changed, such a change is often very difficult and sometimes practically impossible to implement (data is tightly coupled with models, and changing data models entails disturbing/rebuilding associated data, often big volumes of it) – the model is not an Uber model anymore. Additionally, an Uber model requires a separate data store that will accommodate data from other stores (usually with some loss and or distortion of data or data semantics). Implementation of such a store is a complex and expensive endeavor that, as the discussion above shows, is doomed to have a limited life. There is a broad consensus that it is impossible to have such a model, and it is impractical to try to get it. Uber Ontology Ontologies are more flexible than data models. For example, we can introduce representations at new levels of generality by adding branches to an existing taxonomy or we can add new relationships between existing ontology terms. And often we can do all of this without disturbing the bodies of data (i.e. representations of instances) associated with the ontology. It is easy to add new branches to an ontology, or new sub-branches of existing branches. Nevertheless, there is a broad consensus that building an Über-ontology is impossible. While the SE strategy rests on the construction of a suite of ontologies, it does not mean that it rests on the use of some Über-ontology. The SE suite of ontologies is a suite of ontological modules that can evolve in a cumulative incremental fashion. There are two types of ontologies: reference and application. Reference ontologies represent shared understanding of domains and are used in creation of various application-specific ontologies. Thus, the suite of ontologies can be extended indefinitely to cover any domain and any application-specific perception of the domain. We believe that with the help of the suggested SE architecture, the methodology of the SE development, and the governance principles that are described in a number of publications it is possible to build a suite of ontologies that can be shared by a large community. The Best of Two Worlds Relationship Between Ontologies and Data Models Ontologies and data models are relatively independent bodies of data and semantics. When working on either one, however, it is beneficial to consider the other. Ontologies and data models can benefit from each other in the following ways: Data models provide a detailed application-specific representation of a particular domain and of how given data is perceived/processed by its users. This domain knowledge will be one of the domain knowledge sources for ontologies. It will be particularly useful for application-specific ontologies. The process of annotating data from different data sources using common ontologies is guided by the data models used: ontology terms are applied at arm’s length to the terms of the data model, and the organization of these terms (data structures) usually remains as it is in the data model. Ontologies provide a comprehensive and formal representation of a domain that can encompass the model development. In particular, ontologies provide and explain domain terms, relations and dependencies between the terms, which will help to understand the domain and also will ensure that different data models interpret the terms in the same way. We can think of ontology-assisted model building, especially of data models in an agile fashion that will at the same time support horizontal integration – a capability that is very important in today’s dynamic environments. Ontologies provide a shared vocabulary that should be used in HCI applications (not even necessarily data models as we can annotate them) and in data collection. However: Ontology cannot resolve data model problems (of design and or implementation). Having an ontology of a domain(s) does not eliminate the need to build data models when necessary. Moreover, ontologies do not give answers (at least not all answers) as to what the data model should be. Models (in contrast to ontologies that are objective representations of the domain) are application-oriented and their content/structure can be different (and usually is) from those of ontologies. A simple example: we have ontologies of Person, Person Name, Person Identification, Skill, and Address. We also have relations between the terms of these ontologies. One group needs to build a database of their employees’ skills, in particular, of the Person ID, Person Name, Skill Code, and the Skill (only one) that is important to them. Another group wants to build a database about Persons with their SSN, Name, Address and all known Skills with the Skill Code and Skill. We have two distinctly different database models (primary keys are underlined, foreign keys are in italic). In the first case we have two tables: Person (PersonID, PersonName, SkillCode) and Skill (SkillCode, Skill). In the second case we have three tables: Person (SSN, PersonName, Address), Skill (SkillCode, Skill), and PersonSkill (SSN, SkillCode). These models represent the mentioned domains in different ways, and both these representations would be different from the ontology representation. While automated model building based on ontologies does not seem feasible in general, we mentioned above that we can have ontology-assisted (semiautomated) model building. Leveraging Ontologies and Data Models One of the important uses of ontologies is data integration1. This is the main goal of the SE ontologies. Using the SE strategy provides a light-weight, flexible and inexpensive (as it does not require rebuilding data stores) solution to the problem of data integration. It applies ontologies to the data stores at arm’s length: the stores themselves continue to exist as they and thus are able to serve the purposes for which they were implemented. Thus in particular the SE strategy does not change the organization of data; instead, it annotates the types of the source models or ontologies by the type labels of the SE. For example, if for the sources of Figure 1 we annotate Name by PersonName and Skill by SkillName, the data from the database will be exposed using these labels but with preserving the relationships from the data model (the structure of queries and therefore the querying infrastructure, e.g. indexes, will remain unchanged). We can imagine a query SELECT Person.PID, Person.Name, Skill.Skill FROM Person p, Skill s, Person-Skill ps WHERE p.PID = ps.PID AND s.SkillID = ps.SkillID AND p.PID = 123 To be presented as (SE labels are in bold) SELECT Person.PID, PersonName, SkillName FROM Person p, Skill s, Person-Skill ps WHERE p.PID = ps.PID AND s.SkillID = ps.SkillID AND p.PID = 123 1 We understand integration as the ability to process (including, query and search) heterogeneous data as if it were homogeneous and understandable to the users. Even in the simplest case of the SE, when we use only the SE vocabulary for annotations, these annotations provide real enhancement and enrichment of the data. Figure 4 illustrates how on the string of an annotation, without any change to the original data store, we put on the top of a database field the whole knowledge system. Not only can analysts analyze the data about computer skills vertically along the Skill hierarchy (Figure 1), they can analyze it also horizontally for example via relations between Skill and Education, and further… For example, across different sources the analysts can ask for data about skills, computer skills, programming skills. They also can ask about particular education even if the data source does not contain anything about the education. Thus even though data in the database does not change, its analysis can become richer and richer through time, as our understanding of the corresponding domain of reality changes and thereby brings improvements in the ontologies used for annotation. SkillID 91 92 Skill Java C++ Figure 4. Annotation of a data model. References 1. B. Smith. Methodology for Semantic Enhancement of Intelligence Data (white paper) http://ncor.buffalo.edu/SE/SE_Methodology_6_30_2012.pdf 2. Smith B., Malyuta T., Mandrick W., Fu C., Parent K., Patel M. Horizontal Integration of Warfighter Intelligence Data: A Shared Semantic Resource for the Intelligence Community, STIDS Conference, 2012. http://stids.c4i.gmu.edu/papers/STIDSPapers/STIDS2012_T14_SmithEtAl_HorizontalIntegra tionOfWarfighterIntel.pdf 3. Smith B., Malyuta T., Salmen D., Mandrick W., Parent K., Bardhan S., Johnson J. “Ontology for the Intelligence Analyst”, Crosstalk: The Journal of Defense Software Engineering, 2012. 4. Salmen D., Malyuta T., Hansen A., Cronen S., Smith B.. Integration of Intelligence Data through Semantic Enhancement, STIDS Conference, 2011. http://stids.c4i.gmu.edu/STIDS2011/papers/STIDS2011_CR_T1_SalmenEtAl.pdf