'>'^"'^*4-X (^ ' : HD28 .M414 Context Interchange: Representing and Reasoning about Data Semantics in Heterogeneous Systems Cheng Hian Goh, Stephane Bressan, Stuart E. Madnick, Michael D. Siegel Sloan CISL WP# 96-06 December 1996 WP# 3928 The Sloan School of Management Massachusetts Institute of Technology 02142 Cambridge, MA *^'— w .'i^y Context Interchange: Representing and Reasoning about Data Semantics Cheng Hian Goh^ Heterogeneous Systems* in Stuart E. Madnick Stephane Bressan Michael D. Siegel Sloan School of Management Massachusetts Institute of Technology Email: {chgoh,sbressan,smadnick,msiegel}@mit.edu December 1996 5, Abstract The context iNterchange (coin) strategy [44; 48] presents a novel perspective for medi- ated data access in which semantic conflicts a priori, among heterogeneous systems are not identified but are detected and reconciled by a Context Mediator through comparison of con- texts associated with any two systems engaged in data exchajige. In this paper, we present a formal characterization and reconstruction of this strategy in a COIN framework, based on a deductive object-oriented data model and language called COIN. The COIN framework provides a logical formalism for representing data semantics in distinct contexts. this presents a well-founded basis systems. In addition, in defining it for made that reasoning about semantic disparities in heterogeneous combines the best features of an integration strategy that features are We show is loose- and scalable, extensible tight- coupling and accessible. approaches These latter possible by allowing complexity of the system to be harnessed in small chunks, by enabling sources and receivers to remaiin loosely-coupled to one another, and by sustaining an infrastructure for data integration. Keywords: Context, heterogeneous databfises, logic and databases, mediators, semantic interoperability. 'This work is supported in part by ARPA and USAF/Rorae Laboratory under contra£t F30602-93-C-0160, the InternationaJ Financial Services Research Center (IFSRC), cind the PROductivity From Information Technology (PROFIT) project at MIT. 'Financial support from the National University of Singapore is gratefully acknowledged. Introduction 1 The last few years have witnessed an unprecedented growth in the (traditional database systems, data feeds, web-sites, tions that make data requests). This of access to the Internet, which and the rise of new is users, data warehouses, and applica- spurred on by a number of factors: for example, ease is emerging as the de facto global information infrastructure, organizational forms adhocracies and networked organizations (e.g., which mandated new ways of sharing and managing information. The advances and telecommunications technologies have to information meaningfully). This problem in [36]), networking led to increased physical connectivity (the ability exchange bits and bytes), but not necessarily interoperability [47] sources and applications providing structured or (human semi-structured data on requests) and receivers number of information is among autonomous and logical connectivity (the ability to exchange traditionally referred to as the need for semantic heterogeneous systems. Context Interchange: Background 1.1 The goal of this paper for achieving is to describe a novel approach, called Context Interchange (coin), semantic interoperability strategy described in this paper drew Specifically, we share the basic • the detection and among heterogeneous its sources and receivers. The COIN inspiration from earlier work reported in [48; 44]. tenets that reconciliation of semantic conflicts axe system services which are pro- vided by a Context Mediator, and should be transparent to a user; and mediation service requires only that the user furnish a logical • the provision of such a (declarative) specification of conflicts, when how data are interpreted in sources and receivers, and detected, should be resolved, but not what conflicts exists a priori how between any two systems. These insights axe novel because they depaxt from classical integration strategies which ei- ther require users to engage in the detection and reconciliation of conflicts (in the case of loosely-coupled systems; e.g., MRDSM be identified and reconciled, a schemas priori, [33], VIP-MDBMS Multibase Unfortunately, as interesting as these ideas may in the absence of a formal conceptual foundation. augmented with a is or insist that conflicts should by some system administrator, (as in tightly- coupled systems; e.g., basis for Context Interchange [29]), [30], Mermaid be, they could One attempt the semantic-value model property-list which defines its context. [44], in one or more shared [49]). remain as vague musings at identifying a conceptual where each data element is This model, however, continues to be fraught with ambiguity. For example, for different attributes are, as well as kinds of conflicts, and distinct contexts. is silent on how it on implicit agreement on what the modifiers relied what conversion functions axe applicable different conversion definitions for different can be associated with Defining the semantics of data through annotations attached to individual data elements tend also to be cumbersome, and there is no systematic way of promoting the At the same time, the representational sharing and reusing of the semantic representations. formalism remains somewhat detached from the underlying conflict detection algorithm (the subsumption algorithm to [48]). Among be done on a pairwise basis and time), on the is (i.e., by comparing the context definitions non-committal on how a query plan sets of pair-wise conflicts. single-level other problems, this algorithm requires conflict detection (i.e., property lists for multiple sources for two systems at a can be constructed based Furthermore, the algorithm limits meta-attributes to only a cannot be nested), and are not able to take advantage of known constraints for pruning off conflicts which are guaranteed never to occur. Summary 1.2 We have two for the of Contributions (intertwined) objectives in this paper. First, Context Interchange strategy that will not we aim to provide a formal foundation only rectify the problems described earlier, but also provide for an integration of the underlying representational and reasoning formalisms. Second, the deconstruction and subsequent reconstruction^ of the Context Interchange approach described in [48; 44] provides us with the opportunity to address the concern for integration strategies that are scalable, extensible Our formal characterization and accessible. of the Context Interchange strategy takes the form of a COIN framework, based on the COIN data model, which object-oriented model called Gulog^ [15]. coin is among data translated to a normal program is well-defined, there (i.e., is and coin provide us with a well-founded basis disparities that exist a customized subset of the deductive a "logical" data model in the sense that uses logic as a formalism for representing knowledge features of is for expressing operations. for logical making inferences about semantic in different contexts. In particular, a [34] (equivalently, The it coin framework can be a Datalog^^^ program) for which the semantics and where sound computational procedures no real distinction between factual statements for (i.e., query answering data in sources) exist. Since and knowledge statements encoding data semantics) in this logical framework, both queries on data sources [data-level queries) as well as queries on data semantics [knowledge-level queries) can In this dixonstmction, we tease apcirt different elements of the Context Interchange strategy with the gocil of understanding their contributions individually and collectively. The reMnstntction examines how the same features (and more) can be accomplished differently within the formalism ^Gulog is itself a variant of F-logic [28). we have invented. be processed we in an identical manner. As an investigate the adoption of an abductive alternative to the classic deductive framework, framework [27] for query processing. Interestingly, although abduction and deduction are "mirror-images" of each other computed using a simple extension [14], the abductive answers, to classic SLD-resolution leads to intensional answers as opposed to extensional answers that would be obtained via deduction. Intensional answers are useful in our query framework for a number of conceptual and practical reasons. In particular, if the issued by a "naive" user under the assumption that there are no conflicts whatsoever, is the intensional answer obtained can be interpreted as the corresponding mediated query in which database accesses are interleaved with data transformations required for mediating potential conflicts. Finally, constraints, by checking the consistency of the abducted answers against known we show that the abducted answer can be connection to what As much cause it as it is traditionally known greatly simplified, demonstrating a clear as semantic query optimization a logical data model, COIN is is also object- identity, type- hierarchy, inheritance, and overriding) The standard many commonly of the features (e.g., associated with object- use of abstraction, inheritance, as well as structural and behavioral many inheritance [28] present [7]. an "object-oriented" data model be- adopts an "object-centric" view of the world and supports orientation. integrity opportunities for sharing and reuse of semantic encodings. Con- version functions (for transforming the representation of data between contexts) can be modeled as methods attached to types in a natural fashion. formalisms, we make some adjustments we introduce the notion of common sical for our problem domain. In partic- context-objects, described in [37], as reified representations for reference point and is instrumental in providing a structuring mechanism for theories which making inferences across multiple The model by distinguishing between about particular contexts. This allows context knowledge to be defined collections of statements with a to the structure of our which have particular significance different kinds of objects ular, Unlike "general purpose" object-oriented may be mutually inconsistent. reconstruction of the Context Interchange strategy allows us to go beyond the clas- concern of "non-intrusion", and provides a formulation that accessible [21]. By scalability, we is scalable, extensible require that the complexity of creating and administering (maintaining) the interoperation services should not increase exponentially with the participating sources and terms of its local and The above concerns of changes should not have adverse effects on other parts Finally, accessibility refers to ease-of-use number receivers. Extensibility refers to the ability to incorporate changes in a graceful manner; in particular, of the larger system. and flexibility in how the system is perceived by a user in supporting different kinds of queries. are addressed in two ways in the reconstructed Context Interchange strategy^. Provisions for msiking sources more accessible to users is accomplished by shifting the burden for conflict detection and mediation to the system; supporting multiple paradigms data access by supporting queries formulated directly on sources as well as queries medi- for ated by views; by making knowledge of disparate semantics accessible to users by supporting knowledge- level queries and answering with intensional answers; and by providing feedback in the form of mediated queries. distinction Scalability and extensibility are addressed between the representation of data semantics as and the detection of potential conflicts that may arise is known when data by maintaining the in individual contexts, are exchanged between two systems; by the structural arrangement that allow data semantics to be specified with reference to complex object types domain model in the as opposed to annotations tightly-coupled in the individual database schemas; by allowing multiple systems with distinct schemas to bind to the same contexts; by the judicious use of object-oriented overriding in the type system present in the domain model; and by sustaining an ture for data integration that combines these features. infrastructure, we introduce a of context axioms to features, in particular, inheritance As a special meta-logical extension of the be "objectified" and placed effort in infrastruc- providing such an COIN framework which allows in a hierarchy, [5]. This mechanism, coupled with type inheritance, constitutes a powerful approach for incorporating changes Finally, in new system, or changes to the we remark that the feasibility domain model) sets such that new and more com- plex contexts can be derived tlirough a hierarchical composition operator the addition of a and in (e.g., a graceful manner. and features of this approach have been demonstrated a prototype implementation which provides mediated access to traditional database systems (e.g., Oracle databases) as well as semi-structured data Web-sites)^. Organization of this Paper 1.3 The rest of this tional We (e.g.. paper example which is is organized as follows. Following this introduction, we present a motivaused to highlight selected features of the Context Interchange strategy. demonstrate in Section 3 the novelty of our approach by comparing other classical and contemporary approaches. of the COIN data model, which forms the basis it with a variety of Section 4 describes the structure and language for the formal characterization of the Context Interchange strategy in a COIN framework. Section 5 introduces the abduction fram,ework and illustrates how abductive in particular, inference is used as the basis for obtaining intensional answers, and mediated answers corresponding to naive queries For the sake of brevity, further references made (i.e., those formulated while to the Context Interchange strategy from this point on refers to this recon.struction unless otherwise specified. This prototype request. is accessible from any WWW-client (e.g., Netscape Browser) and can be demonstrated upon assuming that sources are homogeneous). Section 6 describes a meta-logical extension COIN framework which allows changes scribes the prototype system be incorporated to to the manner. Section 7 de- in a graceful which provides integrated access to a variety of heterogeneous sources while adopting the framework which we have described. We conclude in Section 8 with a number of suggestions for further work. Motivational Example 2 Consider the scenario shown on "revenue" and "expenses" in Figure 1, deliberately kept simple for didactical reasons. (respectively) for some companies are available collection of two autonomously-administered data sources, each comprised of a a user is interested in knowing which companies have been Data single relation^. and "profitable" in Suppose their respective revenue: this query can be formulated directly on the (export) schemas of the two sources as follows^: SELECT rl.cname, rl .revenue FROM rl Ql: , r2 WHERE r 1 cname = r2.cname AND rl .revenue > r2. expenses; . In the absence of any mediation, this query will return the the extensional data set shown in Figure The above empty answer if it is executed over 1. query, however, does not take into account the fact that data sources istered independently and have different contexts: i.e., they may embody different cire admin- assumptions on how information contained therein should be interpreted. To simplify the ensuing discussion, we assume that the data reported scale-factors of "company financials" in the (i.e., two sources using the currency shown and a scale- factor of Japanese Yen (JPY); reports all "company in 1; USD using a 1, all "company the only exception which case the scale-factor financials" in only in the currencies and financial figures pertaining to the companies, include revenue and expenses). Specifically, in Source in differ is 1000. scale-factor of 1. is which financials" are reported when they Source 2, are reported on the other hand, In the light of these remarks, we make the assumption that the relationcil data model is adopted to be the canonical assume that the database schemas exported by the sources are relational and that queries i.e., we are formulated using SQL (or some extension thereof). This simplifies the discussion by allowing us to focus on semantic conflicts in disparate systems without being detracted by conflicts over data model constructs. The choice of the relational data model is one of convenience rather than necessity, and is not to be construed as a 'Throughout data model [47]: this paper, constraint of the integration strategy being proposed. ®We assume, without loss of generality, that relation ncimes are unique across all data sources. This can always be accomplished via some renaming scheme: say, by prefixing relation name with the name of the data source (e.g., <ibl#rl). Context c. Context c, 2 All "company ftnancials " are reported in USD, using a scale-factor of 1 Alt "company ftnancials " ("revenue " inclusive) are reported in the currency shown for each company Alt "company ftnancials" are reported using Query a scale-factor of 1 except for items reported in JPY, where the 1000 scale-factor is select rl. cname, rl. revenue from rl, r2 where rl cname = r2 cname and rl revenue > r2. expenses . . rl cname revenue IBM 1 NTT 1 currency USD JPY 000 000 000 000 r2 cname Source I . User necessarily being precise at all times; formal definitions of the underlying representational and inference formalisms are postponed to Sections 4, 5, and 6 of this paper. A Functional Features of Context Interchange: 2.1 User Perspective Unlike classical and contemporary approaches, the Context Interchange approach provides users with a wide array of options on how and what queries can be asked and the kinds of answers which can be returned. These features work in tandem to allow greater flexibility and effective- ness in gaining access to information present in multiple, heterogeneous systems. Query Mediation: Automatic Detection and Reconciliation of In a Context Interchange system, the mediator among [54], sites and access Conflicts same query (Ql) can be submitted to a specialized which rewrites the query so that data exchange called a Context Mediator, having disparate contexts axe interleaved with appropriate data transformations to ancillary We data sources (when needed). refer to this transformation as query mediation and the resulting query as the corresponding mediated query. For example, the mediated query MQl: MQl corresponding to Ql SELECT rl.cname, rl .revenue FROM rl WHERE rl. currency = 'USD' AND is given by: r2 , rl.cname = r2.cnaine AND rl. revenue > r2. expenses; UNION SELECT rl.cname, rl. revenue WHERE rl. currency AND rS.fromCur AND rl. revenue = = * 1000 = r2.cname rS.toCur = 'USD' 'JPY' AND rl.cname rl. currency AND 1000 * r3. rate FROM rl, r2, r3 * * rS.rate > r2. expenses UNION SELECT rl.cname, rl. revenue WHERE rl. currency AND r3.fromCur AND rl.cname = = <> * rl. currency 'USD' AND rl .currency AND rS.toCur r2.cname AND rl.revenue This mediated query considers all , r2, r3 <> 'JPY' rS.rate FROM rl = * r3. 'USD' rate > r2. expenses; potential conflicts between relations rl and r2 when com- paring values of "revenue" and "expenses" as reported in the two different contexts. Moreover, the answers returned receiver. 1 Thus in 000 000. More may be further transformed so that they conform to the context of the our example, the revenue of NTT specifically, the three-part will be reported as 9 600 000 as opposed to query shown above can be understood as follows. 8 The 1; first subquery takes care of tuples in this case, there is no for which revenue reported in USD using scale-factor is The second subquery handles conflict. tuples for which revenue is reported in JPY, implying a scale-factor of 1000. Finally, the last subquery considers the case where the currency Conversion among is neither JPY nor USD, in which case only currency conversion different currencies is is needed. aided by the ancillary data source r3 which provides currency conversion rates. This second query, when executed, returns the "correct" answer consisting only of the tuple < 'NTT' Support for 9 600 000>. , Views In the preceding example, the query various sources. While formulated directly on the export schema for the is this provides a great deal of flexibility, what data are present where and be (so as to construct Ql a query). A sufficiently familiar it also requires users to know with the attributes in different schemas simple and yet effective solution to these problems is to allow views to be defined on the source schemas and have users fornmlate queries based on the view instead. For example, we might define a view on relations rl and r2, given by CREATE VIEW vl (cname, profit) AS SELECT rl. cname, rl. revenue - r2. expenses FROM rl, r2 WHERE rl. cname = r2. cname; In which case, query VQl: Ql can be equivalently formulated on the view vl as SELECT cname, profit FROM vl WHERE profit > 0; While achieving essentially the same functionalities as tightly-coupled systems, notice that view definitions in our case are no longer concerned with semantic heterogeneity and mcike no attempts at identifying or resolving conflicts. In fact, any query on a view (say, can be trivially rewritten to a query on the source schema (e.g., VQl on vl) Ql). This means that query mediation can be undertaken by the Context Mediator as before. Knowledge-Level versus Data-Level Queries Instead of inquiring about stored data, it is sometimes useful of data which are implicit in different systems. to be able to query the semantics Consider, for instance, the query based on a superset of SQL*^: Sciore et aJ. [43] have described a similar (but not identical) extension of SQL in which context is treated We are not concern with the exact syntax of such a language here; the issue at hand is as a "first-class object". SELECT rl.cname, rl .revenue. scaleFactor IN cl, Q2: rl. revenue. scaleFactor IN c2 FROM rl WHERE rl .revenue. scaleFactor IN cl <> rl .revenue. scaleFactor IN c2; Intuitively, this (in query asks for companies for which scale- factors for reporting "revenue" in rl context cl) differ from that which the user assumes (in context c2). such as Q2 as knowledge-level queries, as We refer to queries opposed to data-level queries which are enquires on factual data present in data sources. Knowledge-level queries have received little attention in the database literature and certainly have not been addressed by the data integration This, in our opinion, nity. arises primarily If so, a significant gap since heterogeneity in disparate data sources from incompatible assumptions about how data are interpreted. Our to integrate access to differences is commu- among ability both data and semantics can be exploited by users to gain insights into particular systems ("Do sources how?"), or by a query optimizer which A and B may want report a piece of data differently? to identify sites with minimal conflicting interpretations (to minimize costs associated with data transformations). Interestingly, knowledge-level queries anism for can be answered using the exact same inference mech- mediating data-level queries. Hence, submitting query Q2 to the Context Mediator will yield the result: MQ2: SELECT rl.cname, 1000, 1 FROM rl WHERE rl. currency = 'JPY'; which indicates that the answer consists of companies for which the currency attribute has value 'JPY', in which case the scale-factors in context cl and c2 are 1000 and If desired, the mediated query MQ2 1, we would obtain the respectively. can be evaluated on the extensional data set to return an answer grounded in actual data elements. Hence, Figure 1 if MQ2 is evaluated on the data set shown in singleton answer <'NTT', 1000, 1>. Extensional versus Intensional Answers Yet another feature of Context Interchange is that answers to queries can be both intensional and extensional. Extensional answers correspond to fact-sets which one normally expects of a database retrieval. Intensional answers, on the other hand, provide only a characterization of the extensional answers without actually retrieving data from the data sources. preceding example, MQ2 In the can in fact be understood as an intensional answer for Q2, while the tuple obtained by the evaluation of how we might support the underlying MQ2 constitutes the extensional answer for Q2. inferences needed to answer such queries. 10 In the names of COIN framework, intensional answers are grounded in extensional predicates relations), evaluable predicates (e.g., arithmetic operators or "relational" operators), and external functions which can be directly evaluated through system answer is (i.e., calls. The intensional thus no different from a query which can normally be evaluated on a conventional query subsystem of a DBMS. Query step process: an intensional answer answering in a Context Interchange system is thus a two- returned in response to a user query; this can then is first be executed on a conventional query subsystem to obtain the extensional answer. The intermediary intensional answer serves a number of purposes [24]. Conceptually, it constitutes the mediated query corresponding to the user query and can be used to confirm the More often than not, the intensional user's understanding of what the query actually entails. answer can be more informative and easier to comprehend compared to the extensional answer it derives. (For example, the intensional answer MQ2 actually conveys merely returning the single tuple satisfying the query.) computation of extensional answers are compared likely to more information than From an operational standpoint, be many orders of magnitude more expensive makes good to the evaluation of the corresponding intensional answer. It therefore sense not to continue with query evaluation if the the intensional answer satisfies the user. From a practical standpoint, this two-stage process allows us to separate query mediation from query optimization and execution. As we will illustrate later in this paper, query mediation is driven by logical inferences which do not bond well with the (predominantly cost-based) optimization techniques that have been developed [40; 45]. The advantage of keeping the two tasks apart is thus not merely a conceptual convenience, but allows us to take advantage of mature techniques for query optimization in determining how best a query can be evaluated. Query Pruning Finally, observe that consistency checking is process, allowing intensional answers to be performed as an integral activity of the mediation pruned (in answers which are better comprehensible and more some (MQl) would have only third) since the other For example, efficient. modified to include the additional condition "rl. currency = returned cases, significantly) to aurive at ' if Ql had been JPY'", the intensional answer the second SELECT statement (but not the first and the two would have been inconsistent with the newly imposed condition. This pruning of the intensional answer, accomplished by taking into consideration integrity constraints (present as part of a query, or those defined on sources) and knowledge of data semantics in distinct systems, constitutes a form of semantic query optimization [7]. Consistency checking however can be an expensive operation and the gains from a more efficient execution must be balanced against the cost of performing the consistency check during query mediation. 11 In our case, however, the benefits are ampUfied since spurious conflicts that remain undetected could result in an additional conjunctive query involving multiple sources. 2.2 It is Functional Features of Context Interchange: System Perspective natural to assume the internal complexity of any system will increase in commensuration with the external functionalities We make is A it offers. no claim that our approach is The Context Interchange system "simple"; however, we submit that is no exception. this complexity decomposable and well-founded. Decomposability has obvious benefits from a system engi- neering perspective, allowing complexity to be harnessed into small chunks, thus making our more endurable, even when the number of sources and integration approach ponentially large and because it is when changes are rampant. The complexity is receivers are ex- said to be well-founded possible to characterize the behavior of the system in an abstract mathematical framework. This allows us to understand the potential (or limits) of the strategy apart from the idiosyncrasies of the implementation, and how improvements can be made. We is is useful for providing insights into where describe below some of the ways in and which complexity decoupled in a Context Interchange system, as well as tangible benefits which result from formalization of the integration strategy. Representation of 'Meaning' as opposed to Conflicts As mentioned earlier, a key insight of the Context Interchange strategy is that we can represent the meaning of data in the underlying sources and receivers without identifying and reconciling all potential conflicts which exist between any two systems. Thus, query mediation can be per- formed dynamically (as when a query is submitted) or it can be used to produce a query plan (the mediated query) that constitutes a locally-defined view. In the latter case, this view ilar to the shared changes do occur schemas (say, in tightly-coupled systems when sim- with one important exception: whenever the semantics of data encoded in some context Context Mediator can be triggered to reconstruct the is local in tightly-coupled systems, this reconstruction requires is changed^), the view automatically. Unlike the case no manual intervention from the system administrators. In both of the above scenarios, changes in local systems are well-contained and do not mandate human intervention This represents a in other parts of the larger system. significant gain over tightly-coupled systems where maintenance of shared schemas constitute a major system bottleneck. *For an account of why this seemingly strange phenomenon may be more common than be, see [52]. 12 is widely believed to Contexts versus Schemas conceptualization Unlike the semantic-value model, the COIN data model adopts a very different of contexts: instead of a property-list associated with individual data elements, as consisting of a collection of receiver axioms which describes a particular "situation" a source or is in. Domain Model .....7(8)-=:::,-.. semanticNu mber scaleFaclor currency moneyAmt 1^1 currency TVpe companyFinan c i al s semantic-relation r2' we view context semantics tnDg Legend object for which the object-id a Skolem function is relation r present in a source, there Among semantic-objects. defined on [8] an isomorphic relation is r' some key defined values. on the corresponding other features, this structure allows us to encode assumptions of the underlying context independently of structure imposed on data in underlying sources schemes). For each For example, the constraint: all "company financials" are reported in (i.e., US the Dollars within context C2 can be described using the clause^ X'-.companyFinancials, currency {c2, X'):currencyType h curre/Jcy(c2,X')[va/ue(c2) ->'USD']. The method currency object is reported. A is said to be a modifier because it modifies how the value of a semantic- detailed presentation of the details of this framework will be presented later in Section 4. The dichoton y between schemas and contexts present a number of opportunities for sys- tematic sharing and reuse of semantic encodings. For example, different sources and receivers same context may now bind in the which correlate with one another the same semantic-type (e.g., to the (e.g., same set of context axioms; and distinct attributes revenue, expenses) may be mapped to instances of companyFinancials). These circumvent the need to define a new property-list for all attributes in each schema. Unlike the semantic value model, there ambiguity on what modifiers can be introduced as "meta-attributes" since semantic- types are explicitly defined in the domain model. section, views As pointed out all is no properties of in the previous can be (more simply) defined on the extensional relation independently of the context descriptions. This constitutes yet another benefit of teasing apart the structure of a and semantics which are source, implicit in its context. Inheritance and Overriding in Semantic- Types Not in surprisingly, the various features frequently associated with "object-orientation" are useful our representation scheme as archy, well. Semantic-types fall naturally into a generalization hier- which allow us to take advantage of structural and behavioral inheritance economy of expression. Structural inheritance allows a semantic-type tions defined for from moneyAmt scaleFactor. well. its Hence, ®The supertypes. in achieving to inherit the declara- For example, the semantic-type companyFinancials inherits the declarations concerning the existence of the "methods" currency and Behavioral inheritance allows the definitions of these methods to be inherited as if we had defined syntcLx of this language is earlier that instances of moneyAmt has a scale- factor of defined in Section 4 and corresponds to that of Gulog 14 [15]. 1, instances of companyFinancials would inherit the all companyFinancials is same scale-factor since every instance of an instance of moneyAmt. Inheritance need not be monotonic. Non-monotonic inheritance means that the declaration or definition of a viewed cis defaults method can be overridden which can always be changed to Value Conversion Among Thus, inherited definitions can be in a subtype. reflect the specificities at hand. Different Contexts Yet another benefit of adopting an object-oriented model in our framework is that it allows conversion functions on values to be explicitly defined in the form of methods defined on various moneyAmt semantic types. For example, the conversion function for converting an instance of from one scale-factor (F in context C) to another (Fi in context Ci) can be defined as follows: X' -.moneyAmt h X'[cvt(ci)@scaleFactor, C, U^V] <r- X'[scaleFactor(ci)^-[value{ci)^Fi]], X'[sca}eFactor{C)^.[vahieici)^F]], This conversion function, unless explicitly overridden, request for scale-factor conversion on an object which the conversion is to is will may phenomenon is ci , Overriding can take place along defined with reference to context or not) will have to be defined explicitly. This a powerful mechanism for different users to introduce their The apparent redundancy definition in diff'erent contexts) is (in own interpretations of disparate having multiple instances of the same addressed through the adoption of a context hierarchy which described next. Hierairchical composition of By Contexts "objectifying" sets of axioms associated with contexts, relationship among contexts. If c is a subcontext of said to apply in c unless they are "overridden". is C\. allows diff"erent conversion functions to be associated with different contexts and data in a localized fashion. is a only: in order for scale-factor conversion to take place in a different context, the conversion function (which could be identical to the one in is is introduce a different conversion function for a subtype of moneyAmt. Notice that this conversion function Ci be invoked whenever there F/F^. an instance of moneyAmt and when be performed with reference to context the generalization hierarchy: as before, we V = U* to make all c', then we can introduce a all hierarchical the axioms defined in An immediate (/ are application of this concept "functional" contexts subcontexts of a default context Cq, which contains the default declarations and method definitions. Under 15 this scheme, new contexts introduced need only to identify method how definitions it is different from the default context and introduce the declarations and which need to be changed (overridden). This is formulated as a meta-logical extension of the COIN framework and will be described in further details in Section Another advantage of having the domain model in this hierarchy of context is 6. the ability to introduce changes to an incremental fashion without having adverse effects on existing systems. For example, suppose we need to add a new source for which currency units take on a different representation 'Japanese Yen' versus (e.g., ' JPY'). This distinction has not been previously captured in our domain model, which has hitherto assumed currency units have a homogeneous representation. To accommodate the new data source, it is necesscU-y to add a new modifier for currencyType, say format, in the domain model: cuTrencyType[forinat(ctx) => semanticString]. Rather than making changes to all existing contexts, we can assign a default value to this modifier in cq, and at the same time, introduce a conversion function for mapping between currency representations of different formats 'Japanese Yen' and 'JPY'): (e.g., X: currencyType, format{co, X):semanticString h format(co, A')[vaJue(co)—) 'abbreviated']. X -.currencyType h The X -.currencyType, 3 U^V] add to the process last step in this which distinguishes X[cvt(co)@C, it is to format{c' , <r- . . . (body) new context X) .semanticString h . . (cf) the following context axiom: format(c',X)[vaJue(c')-> 'full']. as having a different format. Context Interchange vis-a-vis Traditional and Contemporary Integration Approaches In the preceding section, we have made detailed comments on the many features that the Context Interchange approach has over traditional loose- and tight-coupling approaches. summary, although tightly-coupled systems may provide better support for In data access to heterogeneous systems (compared to loosely-coupled systems), they do not scale-up effectively given the complexity involved in constructing a shared schema for a large and are generally unresponsive other hand, require little to changes for the number of systems same reason. Loosely-coupled systems, on the central administration but are equally non-viable since they require users to have intimate knowledge of the data sources being accessed; this assumption 16 is generally non-tenable when the number of systems involved The Context Interchange approach is and when changes are large frequent^°. provides a novel middle ground between the two: allows it queries to sources to be mediated in a transparent manner, provides systematic support for elucidating the semantics of data in disparate sources and receivers, and at the same time, does not succumb to the complexities inherent in maintenance of shared schemas. At a cursory level, the Context Interchange approach may appear similar to many contem- porary integration approaches. Examples of these commonalities include: • framing the problem in McCarthy's theory of contexts [37] (as in Carnot [11], and more recently, [18]); • encapsulation [3] of semantic knowledge in a hierarchy of rich data types which are refined via sub-typing (as in several object-oriented multidatabase systems, the archetype of which is • Pegasus [2]); adoption of a deductive or object-oriented formalism database System • provision of We [26] and DISCO [28; 15] (as in the ECRC Multi- [50]); value-added services through the use of mediators [54] (as in TSIMMIS [20]); posit that despite these superficial similarities, our approach represents a radical departure from these contemporary integration strategies. To begin with, a number of contemporary integration approaches are in fact attempts aimed at rejuvenating the loose- or tight-coupling approach. These are often characterized by the adoption of an object-oriented formalism which provides support for more effective data transformation (e.g., 0*SQL [32]) or to mitigate the effects of complexity in schema creation and change management through the use of abstraction and encapsulation mechanisms. To some extent, and contemporary approaches such as Pegasus DISCO [50] [2], the can be seen as examples of the latter strategy. text Interchange strategy since they continue to rely on conflicts a priori ECRC and in the human Multidatabase Project These differ from the Con- intervention in reconciling maintenance of shared schemas. Yet another difference though a deductive object-oriented formalism is also used in the is that is al- Context Interchange approach, "semantic-objects" in our case exist only conceptually and are never actually materialized. implication of this [26], One that mediated queries obtained from the Context Mediator can be further We have drawn a sharp distinction between the two here to provide a contrast of their relative features. In one is most likely to encounter a hybrid of the two strategies. It should however be noted that the two strategies are incongruent in their outlook and are not able to easily take advantage of each other's resources. For instance, data semantics encapsulated in a shared .schema cannot be ea-sily extracted by a u.ser to assist in formulating a query which seeks to reference the source schemas directly. practice, 17 optimized using traditional query optimizers or be executed by the query subsystem of (relational) query subsystems without changes. In the Carnot system [11], semantic interoperability axioms which translate "statements" which are true are meaningful in the it Cyc knowledge base A [31]. suggested that domain-specific ontologies is classical [22], is accomplished by writing articulation in individual sources to similar approach statements which adopted is in [18], where which may provide additional leverage by allowing the ontologies to be shared and reused, can be used in place of Cyc. While we like the explicit treatment of contexts in these efforts and share their concern for sustaining an infrastructure for data integration, our realization of these differ significantly. axioms map [23] in our case operate at a finer level of granularity: rather "statements" present in a data source to a First, lifting than writing cixioms which common knowledge base, they are used for describing "properties" of individual "data objects". Second, instead of having an "ontology" which captures aF. structural relationships we have a domain model which is These differences account largely Finally, we remark that the a much among data less elaborate collection of complex semantic-types. for the scalability TSIMMIS objects (much like a "global schema"), [41; 42] and extensibility of our approach. approach stems from the premise that formation integration could not, and should not, be fully automated. TSIMMIS mans of a "light-weight" object and integration model which this translates to the strategy of tics is making sure that attribute for free-text descriptions when the data achieved. However, there are also many We labels are as descriptive as possible may be effective when consensus on a shared vocabulary cannot be (e.g., where data sources are relatively some consensus can be reached) where human intervention is is not primarily responsible for the different approaches our strategy. The COIN Data Model: 4 This motivated the invention concur that this approach other situations appropriate or necessary: this distinction TSIMMIS and assist hu- ("man-pages") which provide elaborations on the seman- sources are ill-structured and well-structured and where activities. mind, intended to be self-describing. For practical purposes, of information encapsulated in each object. taken in this in opted in favor of providing both a framework and a collection of tools to in their information processing and opting With in- Structure, Language, and Frame- work In [37], McCarthy pointed out that statements about the world are never always true or false: the truth or falsity of a statement can only be understood with reference to a given context. This is formalized using assertions of the form: 18 ist{c,a) c: which suggests that the statement a an outer context asserted in is true ("«<") in the context c^\ this statement itself being Lifting axioms^^ are used to describe the relationship c. between statements in different contexts. These statements are of the form ist{c,a) c: which suggests that "cr <=> ist{c',a') in c states the same thing McCarthy's notion of "contexts" and "lifting eling statements in heterogeneous databases From facts this perspective, factual as a' in c"'. axioms" provide a useful framework which are seemingly in conflict for mod- with one another. statements present in a data source are no longer "universal" about the world, but are true relative to the context associated with the source but not necessarily so in a different context. Thus, with sources 1 and 2 in Figure 1, if we we may now c: ist(ci, rl('NTT', 1000 000, 'JPY'))- c: «s<(c2, r2(' NTT', 5 000 000)). The context c assign the labels may be omitted and C2 to contexts associated write: above refers to the ubiquitous context the integration context) and c\ in the in which our discourse is conducted ensuing discussion whenever there (i.e., is no ambiguity. In the Context Interchange approach, the semantics of data are captured explicitly in a collection of statements asserted in the context associated with each source, while allowing conflict detection and reconciliation to be deferred to the time when a query Building on the ideas developed in [48; 44], we would like to is submitted. be able to represent the semantics of data at the level of individual data elements (as opposed to the predicate or sentential level), which allows us to identify and deal with individual data elements the value 1 000 000 may be present in relation "in the words of Guha [23], in a relation rl (as shown in Figure IBM and NTT while being reported associated with the context." conflicts at a finer level of granularity. Unfortunately, without a unique denotation. For instance, 1) in different currencies simultaneously describes the revenue of and scale-factors. Thus, the statements contexts represents "the reification of the context dependencies of the sentences They are said to be "rich-objects" in that "they cannot be defined or completely Consider, for instance, the context associated with the statement; "There are fovir billion people hving on Earth". To fully qualify the sentence, we might add that it assumes that the time is 1991, However, described" [38]. not the only relevant assumption in the underlying context, since there are implicit assumptions considered a "live person" (are fetuses in the womb alive?), or what it means to be "on earth" include people who are in orbit around the earth?) this certainly about who (does it is is aJso called articulation axioms in Cyc/Carnot [11]. 19 ist{ci, currency(R,Y) f- t1{N,R,Y)). ist{ci, scaIeFactor{R, 1000) <- currency(R,Y), ist{cu scaleFactor{R, \) Y <- currency {R,Y), F ='JPY'). t^'JPY')- intending to represent the currencies and scale-factors of revenue amounts will result in multiple To circumvent inconsistent values. this problem, we introduce semantic-objects, which can be referenced unambiguously through their object-ids. Semantic-objects are complex terms constructed from the corresponding data values (also called primitive- objects) and are used as a basis for inferring about conflicts, but are never materialized in an object-store. This will be described in further details in the next section. The data model underlying our integration approach, called COIN (for context INterchange), both a structural component describing how data objects are organized, and a consists of lan- guage which provides the basis for niciking formal assertions and inferences about a universe of discourse. In the remainder of this section, we provide a description of both of these com- ponents, followed by a formal characterization of the Context Interchange strategy in the form COIN framework. The of a introduced in Section The 4.1 latter will be illustrated with reference to the motivational example 2. Structural Elements of The COIN data model is COIN a deductive object-oriented data model designed to provide explicit support for Context Interchange. Consistent with object- orientation modeled as in objects, having unique and immutable object-ids a generalization hierarchy with provision between two kinds of data objects types, oid information units are and corresponding to non-monotonic inheritance. in coin: primitive objects, distinguish in COIN have both an value: these are identical in the case of primitive-objects, but different for semantic- objects. This an important distinction which is Primitive-types correspond to data types to sources We types which are instances of primitive and semantic-objects which are instances of semantic-types. Objects and a and receivers. will become apparent (e.g., strings, integers, shortly. and reals) which are native Semantic-types, on the other hand, are complex types introduced to support the underlying integration strategy. ties for (oids), [3], Specifically, semantic-objects may have proper- which are either attributes or modifiers. Attributes represent structural properties of the semantic-object under investigation: for instance, an object of the semantic-type companyFinancials must, by definition, describes by defining the attribute company other hand, how Eire some company; we capture for the semantic-type this structural dependency companyFinancials. Modifiers, on the used as the basis for capturing "orthogonal" sources of variations concerning the value of a semantic-object may be interpreted. Consider the semantic-type 20 money Ami: the modifiers currency and scaleFactor defined for in how the value corresponding to an instance of money Ami suggests two sources of variations moneyAmt may be interpreted. "Orthogonal- here refers to the fact that the value which can be assigned to one modifier ity" of other modifiers, as the case with scaleFactor and ciu-rency. This is is is independent not a limitation on the expressiveness of the model since two sources of variations which are correlated can always be modeled as a single modifier. As we in dealing shall see later, this simplification allows greater flexibility with conversions of values across diff"erent contexts. may be Unlike primitive-objects, the value of a semantic-object For example, texts. of NTT, to it is if the (Skolem) term sko the oid for the object representing the revenue is perfectly legitimate for both (1) ist{ci, value{skQ,1000 000)); and (2) ist{c2, value(sko,9 600 000)), be true since contexts c\ and C2 embody different assumptions on what currencies and factors are used to report the value of a revenue amount^''. For our the case that the value of a semantic-object the case in the example above, where (1) by a special lifting Definition 1 Let t in /cvt, some known, but not often context, but not others. This (2). t is given in context The derivation of (2) is is aided ^ X) <^ it is ist{cs, value{t,X')). derived from context and say that X is the value of it is t Q Cs- shall see later, the conversion function referenced dependent on the type of the object to which r, For any context represented by C, we have Cg. as the conversion function for t in context C, context C, and that As we in it is be an oid-term corresponding to a semantic-object of the semantic-type ist{C, value{t,X) <- fcvt{t,Cs,X') refer to is known is problem domain, scale- axiom defined below. and suppose the value of We different in different con- above applied, is polymorphically defined, being and may be different in distinct contexts. Since modifiers of a semantic-type are orthogonal by definition, the conversion function referenced in the preceding definition can in fact be composed from other simpler conversion methods defined with reference to each modifier. To distinguish between the two, we first as a composite conversion function, A and the latter as predicate-calculu.s language refer to the atomic conversion functions. Suppose is used in the discussion here since it provides better intuition for most readers. which properties are modeled a-s "methods" (allowing us to write sA:o[vaJue->l 000 000] as opposed to value{skoA 000 000)), will be formaJly defined in Section 4.2. The COIN language, for 21 modifiers of a semantic-type r are mi, T. It follows that fcvt{t, if t is = Cs,X') X . . , . an object of type 3X\, if . .. , mj^, r, and /cvt is a composite conversion function for then Xk-] such that {f^,'2it^c,,X') = X,) A--- A {£lit^Cs,Xk-,) = X) where f^H corresponds to the atomic conversion function with respect to modifier rrij. Notice that the order in which the conversions are eventually effected need not correspond to the ordering of the atomic conversions imposed here, since the actual conversions are carried out in a lazy fashion Finally, ments and depends on the propagation of variable bindings. we note that value-based comparisons We here. in the relational say that two semantic-objects are distinct may be distinct semantic-objects Definition 2 Let © if model requires some adjust- their oids are different. However, semantically-equivalent as defined below. be a relational operator from the set {=, <,>,<,>, 7^, . .}. If i and t' are oid-terms corresponding to semantic-objects, then O (t®t') In particular, (value{t,X) we say that t A vahie{t',X') and t' A X ® X') are semantically-equivalent in context c if ist{c,t=t'). We sometimes abuse the notation slightly by allowing primitive-objects to participate in semanticcomparisons. Recall that we do not distinguish between the oid and the value of a primitive object; thus, ist{C, va/ue(l we know 000 000, 1 1000 that ist{Ci, value{skQ, 000000)) is true irregardless of what C may 000)), where sko refers to the revenue of be. NTT Suppose as before. The expression sko < 5 000 000 will therefore evaluate to in context c\ but not context C2, This latter fact can be derived from the value of sko 9 600 000)). a priori in "true" rj and since ist{c2, in ci (which value{sko, is reported the conversion function associated with the type companyFinancials (see Section 5.3). 4.2 The Language We describe of COIN in this section the inspired largely by Gulog syntax and informal semantics of the language of COIN, which [15]^''. Rather than making inferences using a context is logic (see, for [28] in that method rules cire bound to the underlying types, which leads to different non-monotomc inheritance. Specifically, in the case of F-logic, it is not rales but ground expressions that axe handed down the generalization hierarchy. Since we are interested in reasoning at the intensional level, the former model is more appropriate for us. ^"Gulog differs from F-logic approciches for dealing with 22 example, [6]), we introduce "context" contexts through the use of parameterized methods. value{sko, 000 000)) can be equivalently written as sko[value{c\)^ 1 common "vocabulary" (i.e., what types in different For example, the context-formula represents a (single- valued) method. This simplification to a and capture variations as first-class objects is 1 is/(ci, 000 000] where va]ue(ci) possible because of our commitment and what methods are applicable) and the exists remains immutable across different contexts. By writing statements which fact that object ids are fully decontextualized the integration context), we from the individual source and receiver contexts "lifted" (i.e., into are able to leverage on semantics and proof procedures developed without provision for contexts. Following [34], we define an alphabet as consisting of (1) a set of type symbols which are partitioned into symbols representing semantic-types and primitive-types: each of which have a distinguished type symbol, denoted by T^ and T-p respectively; (2) an infinite set of constant symbols which represents the oids (or identically, values) of primitive-objects; function symbols and predicate symbols; and modifiers, built-in methods a set of method symbols corresponding to attributes, (4) va/(ie (e.g., and cvt); usual logical connectives and quantifiers A, V, V, 3, :, —>, ::, A term a function symbol and is oids, they are called oid-terms. n-ary if of T • T and if i declaration r' are if also is p if is ii , . . is infinite set of variables; (6) the etc; (7) auxiliary symbols such as (, . , either a constant, a variable, or the token /(ii, i,j ), ], [, . . . , tn) are terms. Since terms in our model refer to (logical) Finally, a predicate, function, or method symbol said to be is a subtype of a term and r is in is defined as follows: type symbols, then r is r' is :: a type declaration. r' is We say that r is a subtype a supertype of t. For any type symbol r" such that r' :: r", t". a type symbol, then r. If r' is We t t : a supertype of an n-ary predicate symbol, and predicate declaration. • is and conversely, that t', is A an instance of type • an expects n arguments. if it Definition 3 • ->, (5) ^ and so forth; and finally, (8) a set of context symbols, of the distinguished object-type called ctx, denoting contexts. where / a set of (3) ti , . t, , . . is an object then t is an attribute declaration. We t, r' r^ are type symbols, then p(ti is ri x • • • x , . m. 23 is , . . t„) is /, . is a r„. are symbols denoting semantic-types, then say that the signature of the attribute the semantic-type r has attribute say that said to be of inherited type t' say that the signature of predicate p an attribute symbol and We declaratiori. t—>t', T[m^T'] and that • if is m a modifier symbol, and is r, r' We a modifier declaration. m say that Without any signature t—^t'. are symbols denoting semantic-types, then T[m{ctx)=i-T'] is a modifier of the semantic-type m we assume that loss of generality, is which has r, unique across all semantic-types. • if r a semantic-type, and t\,T2 are primitive types, then T[cvt(ctx)@ctx,Ti=^T2\ is compound conversion for r • if T is declaration. m T[cvt(ctx)@'m,ctx, Ti=>T2\ if is r a semantic-type, is a value declaration. and is a modifier defined on IS a atomic conversion declaration. of the atomic conversion of • compound say that the signature of the a conversion n—>r2. r X a semantic- type, is We is m for r, is r, are primitive types, then ri, T2 We say that the signature t x ti—>r2. ti is a primitive-type and c We say that the signature of the value for t is a context symbol, then T[value(ctx)—>Ti] r—>ti. given by is Declarations for attributes, modifiers, conversions, and the built-in method value are collectively referred to as method An atom Definition 4 • if p • if defined as follows: is an n-ary predicate symbol with signature is type declarations. Ti, m is . , . . T„ respectively, then p{ti, if m is if the • is t, t' T, Ti , T2 respectively, c is is . . ,tn are of (inherited) are of (inherited) types a context symbol, and r, r' t, t' are of tc is (in- a context term, then a compound conversion atom. atom of the modifier m has signature r x tj—>r2, c ^1,^2 are of (inherited) types r, ti, T2 respectively, a atomic conversion atom for m. the value signature is given by t-^t', c T, Ti, . a modifier atom. is types t\, a context symbol, and t[cvt{c)@Tn,tc,ti-^t2] if and conversion function for r has signature t x ti^T2, t,ti,t2 are of the atomic conversion t, t„ an attribute atom. is compound symbol, • a predicate atom. respectively, then i[m(c)—>i'] r, r' t[cvt{c)@tc,t\^t2] if • is herited) types • • a modifier symbol with signature t—>t', c (inherited) types • is x an attribute symbol with signature t^t' and respectively, then ([m-^i'] • ,tn) ri then t[vahie{c)-^t'] is is tc is a context symbol, and a value atom. 24 and is a context a context term, then t, t' are of (inherited) As atoms corresponding before, the and to attributes, modifiers, conversions, built-in method value are referred to collectively as method atoms. Atoms can be combined form molecules to (or compound atoms): these are "syntactic sugar" which are notationally convenient, but by themselves do not increase the expressive power of we may the language. For example, write ,mk-^tk] as a shorthand for the conjunct t[mi-^t\] A • t[mi—^ti; • t[m^ti[mi-^t2]] as a shorthand • t T[m^t'] as a shorthand : for < for : i[m—>ii] A A t ii[mi->i2]; • • • A t[mk^tk]; and t[77i^t']. Well formed formulas can be defined inductively in the same manner as in first-order lan- guages [34]; specifically, • an atom • [{(f) and • if is (/) a formula; is are formulas, then if a formula and X is -'(p, cp Aip and (py ip are all formulas; a variable occurring in 0, then both (VX(/)) and {3X (p) are formulas. Instead of dealing with the complexity of full-blown first-order logic, it is customary to restrict well-formed formulas to only clauses. Definition 5 A Horn clause in the COIN language is a statement of the form rhA^Bi,...,B„ where is A can either be an atom or a declaration, and B\,. called the head, the form t[rn@ . . . and Bi,. ->i'] , . where . B^ is t is . . ,Bn is called the body of the clause. a conjunction of atoms. If A is a method atom of a term denoting a semantic-object, then the predeclaration must contain the object declarations for all oid-terms in the head. Otherwise, F F may be omitted Q altogether. 4.3 A The COIN Framework The COIN framework builds on the COIN data model to provide a formal charcicterization of the Context Interchange strategy for the integration of Definition 6 A coin framework J^ is heterogeneous data sources. a quintuple <S,^,£,'D,C> where 25 • 5, the source set, name a labeled multi-set {s\ := S\,...,Sjn :— S„i}. is • fj,, which are known to hold on those predicates. The constitute a relation Si is in V, the domain model, domain model corresponding to S to C. If fj,{si) = Cj, we a set consisting of declarations. Intuitively, declarations in the and predicates which are known. , . . E,„} where Ei the set of elevation axioms is G there Si, is a clause which defines a corresponding semantic- which every primitive object in Vij is replaced by a Skolem term in • for every oid-term in attributes); for every r[,, we identify its type via the introduction of and define the values which are assigned declcu-ation, - atoms of S. Ei consists of three parts: for ear'i relation r^j r' set of Cj. a multi-set {Ei,. is Sj in relation r[- in - is context identify the types, methods, • £, the elevation set, - the is r^j in Sj. the source-to-context mapping, defines a (total) function from say that the source • label Si of a source, and Si consists of ground predicate atoms rij{ai,...) as well as the integrity constraints Tij The an object to structural properties (i.e., and oid-term in r[,, we define its value in context c(= n{st)) with reference to rij. • C, the context multi-set, is a labeled multi-set {c\ := C\,...,Cn '= context symbol, and Ci, called the context set for Cj, description of the relevant data semantics in context We provide the shown in Figure 1 can be represented in a coin framework The contents sources. We we of the source set S is simply the place no limitation on the rules following the • The function /i is — > a which provides a Qi Cj. T= is how the <S, £, /i, integration scenario V, C>. Figures 3 and 4 it set of number ground atoms present of relations which in the may be data present in happens that each source has only one relation. ground atoms are functional dependencies which are known to be true in the respective relation. dependency cname Cj will elaborate briefly below: each source; in the current example, The set of clauses intuition for the above definition by demonstrating present a partial codification which • is Cn} where For instance, the two rules in {re venue, currency} on the attributes in defined as a relation on S x C: thus, source whereas S2 and S3 are both mapped to context 26 C2- si S\ defines the functional r\. is mapped to context ci, Source 5 set si := { riClBM', 1000000, 'USD')- ri('NTT', 1000000, 'JPY'). =^2 fii 52 := { ^ri(/V,fli,.),n(7V,fl2,-). Yi=Y2i-ri{N,.,y\),r,(N,.,Y2). } r2('IBM', 1500 000). raCNTT', 5000 000). Ei=E2^ro{N,Ei),r2(N,E-2). 53 := { , Source-to-Context Mapping {/x(Si , } rsCUSD', 'JPY', 104.0). rsCJPY' 'USD', 0.0096). Ti = T2^r3{X,Y,Ti),r3{X,Y,T2). } fj, Ci), fl{s2,C2), ^l{S3,C2)} Domain model V /* type declarations */ T5. semanticNumber T5. semajiticString moneyAmt semanticNumber. :: :: number :: varchar :: integer :: companyFinanciaJs moneyAmt. currencyType semanticString. real :: :: Tp. Tp. :: number, number. :: companyName :: semanticString. /* attribute declaration */ companyFinancials[company /* modifier => companyName] declarations */ moneyAmt[currencyfctxj=> currencyType; /* value scaleFactor(ctx)-- semanticNumber] declarations */ semanticString\va\ue{ctx)^ varciiar]. semanticJVumber[va/uefctxj=> number]. /* conversion declarations */ semanticString[cvt(ctx)@ctx,varcbar => varchar]. semanticNumber[cvt(ctx)(§ctx,number => number]. moneyAmt[cvt('ctx)@ctx,number => number]. money Amt[cvt(ctx)@ctx,scaleFactor,number => number]. moneyAmt[cvt('ctxj@ctx,currency,number => number]. Figure 3: /* predicate declarations */ r[ (company iVame,companyFinanciaJs, currency Type) r2(companyName,companyFinanciaJs). ri (v'archar,integer, varchar), r!^{currencyType,currencyType,semanticNumber). rs ( varchar, varchar,real). The source set, source-to-coiitext r2 ( varchar,integer ) mapping, and domain model corresponding to the motivational example. 27 for the COIN framework • V The domain model consists of two parts. the semantic-types which are (1) tions for The known and the methods which are applicable left-half (as seen in Figure 3) identifies generalization hierarchy; (2) the declara- to the semantic-types; predicates corresponding to the semantic relations (r,). The and the signatures for the same right-half does the for primitive-types and predicates for the exteiisional relations. • The first clause in each £?, sponding to the relation of the elevation set rf, £ defines the semantic relation r[ corre- the semantic relations are defined on semantic-objects {as opposed to primitive-objects), which axe instantiated as Skolem terms. The Skolem function (e.g., /r2#expenses) ^r^ chosen in the of a tuple in the corresponding relation way such that when applied (e.g., to the key-value 'NTT'), the resulting Skolem term /r2#expenses('NTT')) would in fact identify a unique "cell" in the relation as Figure 2 (in this case, the expenses of NTT as reported in relation • shown in r2). Object declarations and attribute atoms in the elevation set provide a way of specifying the types of corresponding Skolem terms introduced in the semantic relation. stance, any Skolem term companyFinancials. is • (i.e., assigned to the The The /ri#revenue(-) attribute company following this declaration defines the object that attribute for this semantic-object values of the Skolem terms introduced in the semantic relation are defined through the clauses shown The last. primitive-objects assigned are obtained directly from the extensional relation. Clearly, the value assignment source as identified by /i; is valid only within the context of the the values of the Skolem terms in a different context can be derived through the use of conversion functions, which The context shown Figure in in- asserted to be an instance of the semantic- type is atom For multi-set 5. C is we will define later. given by {ci := Ci,C2 := C2} and There are two kinds of axioms: modifier value is defined by the axioms definitions and conversion definitions. Consistent with our data model, modifiers can be assigned mechanism texts: this constitutes the principle for describing the For example, the fact that in context contexts. scale-factor of 1 000 whenever it is diflferent Ci, values in distinct con- meaning of data companyFinancials are reported using a reported in JPY, and 1 otherwise, can be represented by the formula: VX' : ( companyFinancials 3F' : number h X'[sca/eFactor(ci )^F'] )A (F'[vaiue(ci)-> 1000] <- X'[curreJicy(ci)^y'] A Y' ='3PY> 28 in disparate ) A Elevation Axioms Ei of ^ '"l(/rl#ciiame(-'^l),/rl#revenue(-^l)./rl#currency(-^'l)) <~ ^l('^l)-. compajiyNanie. /ri#cnan.e(-) companyFinancials. /ri#revenue(-) /rl# revenue (^l) [company -> /rl#cname(A'i currencyType. /ri#currency{-) -)• : : )]. : /rl#cname(A'i)[vaJue(C)^,Yi] f- Ti (.Yi ,.,.), /x(Si , C). /rl#revenue(-Vi)[vaJue(C)^A'o] <- Tj (Xi -Va, -), m(S1 C)(Xi _, X3), ^(Si , C). /rl#currency(A'l)[va/ue(C)->A'3] <, H Elevation Axioms Eo of , , £^ '"2(/r2#cname(A'l),/r2#expenses(-'^l)) <" T^iXi,.). /r2#cname(-) : conjpauyiVanje. /r2#expeiises(-) companyFinancials. : /r2#expenses(Ai)[cOnipany -> /r2#ciiame(-'^l /r2#cname(A',)[vaiue(C)->Xi] <r- r-iC-Yi , )]• _), /l(so, C). /r2#expenses(A'i)[vaiue(C)-^A'2] <- r^iXi, X2), n{S2,C). El evation Axioms £3 ^ of '3(/r3#fromCur(-^l A'a), /r3#toCur(-^I , /r3#fromCur(-,-) /r3#toCur(-,-) : : , A'2), /r3#exchangeRate(A"i , Xi)) <- r3(Xi X2,-). Currency Type. curreiicyType. semanticNumber. r3(Xi X2, .), M«3 C) /r3#fron,Cur(Ai,X2)Mue(C)-^A,] r3(Xi,X2,-),/i(s3,C). /r3#toCur(A,,X,)[vaJue(C)^A2] /r3#exchai,geRate(Ai A2)[vailie(C)^ A'3] <r- r3(Xi X2, X3) ^l{s3,C). /r3#exchai.geRate(-, -) : ^ , , e , , Figure 4: , Elevation set corresponding to the motivational example 29 , Context Ci: /* modifier value assignments */ X' companyFinajicials \- X'[scaJeFactor(ci)->- scaJeFactor{ci,X')]. X' companyFinanciaJs, scaleFactor{ci,X') number h : : : X' : sca/eFactor(ci,X')[vaJue{ci)^ l] <- X'[currency(ci)^Y'],Y' ^'JPY'. companyFinanciaJs, scaJeFactor(c\,X') number h : scaleFactorici, X')[value{ci)^ 1000] <- X'[currency{ci)^Y'],Y' ='JPY'. X' X' : companyFinanciaJs h X' [currency (ci)^ currency {ci,X')]. companyFinanciaJs, currency {ci,X') currencyType h : : currency(cuX')[value{ci)^Y] ^ X'[company -^NI,],r\(N[,R',Y'),NI, i N[, Y'[value{ci)^Y]. /* conversion function definitions */ X' moneyAmt \X'[cvt{ci)@C,U^V] <- X'[cvt{ci)mscaJeFactor,C,U^W], X'[cvt{ci)@currency,C, W-¥V]. X' moneyAmt \: : X'[cvt{ci)@scaJeFactor,C, U^V] <- A"[scaJeFactor(ci)->-[va/ije(ci)->Fi]], X'[scaleFactor{C)^.[value{ci)^F]], X' : moneyAmt h .Y'[cvt(ci) ©currency, C, X' : moneyAmt U-^V] <- X'[currency{ci)-^Y{], X'[currency{C)-^Y'], I- X'[cvt{ci)®currency,C,U^V] <- X'[currency{ci)^Y(],X'[currency{C)^Y'], y; % y, r'.iY}, f/, R'[value{ci)^R],V Context V = U* R), y} =U * i r/, y; i Y', R. C2: /* modifier value assignments */ X' companyFinancials h X'[scaJeFactor(c2)->- scaJeFactor(co,X')]. : X' X' X' : : : moneyAmt, sca}eFactor{c2,X') : number h scayeFactor(c2, A")[vaiue(c2)^ l). companyFinanciaJs h X'[currency(c2)^ currency (c2,-X'')]. moneyAmt, currency {c2,X') currencyType h : currency(c2,A")[vaJue(c2)-> 'USD']. /* conversion definitions are similar to ci Figure 5: Context sets for C and omitted for brevity */ for the motivational 30 example at hand. F/F^ ^ (F'[va7ue(ci)^ The above formula is l] ^ X'[ciirrency(ci)^Y']AY' ^'JPY' ). not in clausal form, but can be transformed to definite Horn clauses by Skolemizing the existentially quantified variable F'. For example, the above formulas can be reduced to the following clauses: X' : company Financials h X'[scaleFactor{ci )^fscaleFactor{ci}(^')]- X' : company Financials, Aca/eFacMc,)(^')[va/»e(ci)^ A'' : companyFinancials, fscaieFactorici) fscaieFactor{c-2)(-^') 's a 1 OOO] ^ l] X'[currency(ci)->r'],y'^'JPY'. number h ' fscaleFactor{ci)(^') fscaieFactoH.cM^')[y^iiie{cj)^ where number h JscaieFactor(c^){^') ^ X [currency (c,)^Y'],Y' ' unique Skolem function; notational simplicity, for with the term scaleFactor(c2, X'). ^MPY'. we Currency values corresponding to stances of companyFinancials are obtained directly from the extensional relation r\ as in Figure 5. In this instance, it is as context C2), the modifier values for ciu-rency sive to capture intensional It is shown In a "better-behaved" situation (such and scaleFactor can be defined independently worthwhile to note that our framework both types of scenario, although the and extensional knowledge more first tends to is sufliciently expres- make the boundary between fuzzy. Conversion functions define how the value of a given semantic-object can be derived the current context, given that shown in Figure moneyAmt The 5, the first its value is known with respect to a different context. is The currency conversion which is defined by identifying the respective scale-factors in the source is ratio of the obtained by multiplying the source value by a conversion rate obtained via a lookup on yet another data source are defined with respect to As and currency. and target contexts and multiplying the value of the moneyAmt object by the two. in clause in the gi'oup (for context ci) defines the conversion for via the composition of atomic conversion functions for scaleFactor scaleFactor conversion in- necessary to reference an extensional relation because "meta- data" are represented along with "data" in a source. of the underlying schema. replace moneyAmt but (r:^). Notice that these conversions are applicable to companyFinancials via behavioral inheritance of the methods. In general, the repertoire of conversion functions can be extended arbitrarily built-in by defining the conversion externally and invoking the external functions using the system predicate which serves as an escape hatch to the operating system. However, encapsulating the conversion in external functions makes it harder to reason about the properties of the conversion; for examj^le, the explicit treatment of arithmetic operators 31 and table-lookups conversion functions) allow us to exploit opportunities for optimization, say, by rewriting (ill the arithmetic expression to reduce the size of intermediary tables during query execution^^. Query Answering 5 Abductive Inferences as Following the same algorithm outlined in [1], any collection of COIN clauses can be translated to Datalog with negation (Datalog"'^) (or equivalently, normal Horn program semantics as well as computation procedures have been widely studied we explore an alternative approach based [34]), for which the In this section, [51]^^. on abductive reasoning. The abductive framework We provides us with intensional (as opposed to extensional) answers to a query^'^. describe this abductive framework below and the relationship between query mediation in a coin framework and query answering familiarity with logic part, shall remain in an abductive framework. programming we assume some In the interest of space, ensuing discussion, and for most at the level of [o4] in the faithful to the notations therein. The Abductive Framework 5.1 Abduction refers to a particular kind of hypothetical reasoning which, in the simplest case, takes the form: From observing A and the axiom Infer 5 abductive reasoning. is -^ A as a possible "explanation" of A. Abductive logic programming (ALP) T B a theory, I is Specifically, an extension of logic an abductive framework [17] [27] is a set of integrity constraints, and abducible predicates. Given an abductive framework observation) and a , is is a triple [34] to support <T,A,I> where a set of predicate symbols, called <T,A,I> and a sentence 3Xq{X) (the the abductive task can be characterized as the problem of finding a substitution 6 set of abducibles (1) TuA\=q{X)9, (2) TUA ^^ ^ programming satisfies I; A, called the abductive explanation for the given observation, and Details of query optimization strategies that taJce into eiccount conversion functions are the work reported here. such that A more detailed discussion can be found in beyond the scope of [13]. '^The fact that "object-based logics" can be encoded in classicfil predicate logic has been known for a long time (see for example, [9]). This however should not cause us to "lose faith" in our data model, since the syntax of the language plays a pivotaJ role in shaping our conceptualization of the problem amd in finding solutions at the appropriate levels of abstraction. '^This change in perspective is beneficial for a variety of reasons (see Section 2), 32 and will not be repeated here. A (3) has some properties that make Requirement (1 ) A, together with T, must be capable of providing an explanation states that The the observation q{X)9. we ambiguity, for consistency requirement in (2) distinguishes abductive explanations in the characterization of from inductive generalizations. Finally, primarily that literals in "interesting". it A A are atoms formed from abducible predicates: refer to these atoms also as abducibles. means in (3), "interesting" where there In most instances, we would like is no A to be minimal or non-redundant. also Semantics and proof procedures ALP for have been active research topics recently (see [27] We describe in this section an abduction procedure based on extensions called SLD+ Abduction. The underlying idea is first reported in [12], and and references therein). to SLD resolution, The account we has inspired various different extensions. We consider first SLD resolution. Given a theory give here follows that in T consisting of (definite) and a goal clause <— q{X), and SLD-refutation of <— q{X) q(X)); <— Gi;-;<— Gn where <— from <— Gt by resolving one of may be many Since there of possible refutations is in a which terminates in clauses A, such that g. () is some <- can be resolved with the selected of ways G^, it an empty TU A |= will not clause). (e.g., in is Go(= obtained The literal, search space defined by an literal g will not resolve with any is not worth contain any branch that leads to a refutation Given however that we are searching <— Gj+i, which a space a depth-first manner). whose selected G, then clearly by letting we can continue the search with clauses (the selected literal) with one of the clauses in T. T which number and each <— Gi+i This means that the part of the subtree with <— G, at the root exploring any further, since g, the empty clause Horn a sequence of goal clauses <— defined (in the form of an SLD-tree). Suppose now that there with is its literals clauses in SLD-tree may be searched clause in T. G„ is [46]. A is (i.e., one for a set of unit include a unit clause which resolves obtained from <— Gj minus the literal This observation forms the basis for the SLD+Abduction procedure which we proceed to describe below. Given an abductive framework <T,A,1> and the abductive query q{X), consider the sequence given by <— Go, Ao where Go = q{X) and Ao is the empty set <-Gn,A„ such that Gt+i, • if g, step At_|_i is derived from Gj, A, as follows: the selected literal of <— Gj, can be resolved with a clause in T, then a single resolution is taken to yield Gj+i, and Aj^-i = A^; 33 • g if is abducible, {5' <— } is empty g with is said to derivation, as this refutation, variables all just defined, complexity since it due to the "3' <— " [16], it is becomes necessary fact that a may have (say in <— Gi) A„ said to be a refutation is to V ( q{X)6 is ), where 6 not done, the abducted fact "g' is the substitution explicitly The to deal with equality constraints is to (where g' = g) on Skolem constants. {sk) introduced earlier in the SLD-derivation [t] later be unified with a term and the consistency of sk [17]. = on (in <— Gj, where j > i)- t when the equality fact sk with other abduced facts and Suppose that the selected usual negation-as-failure mechanism the refutation i, [10] as integrity constraints. FEQ is — Ik checked. described can be extended to cope with negation through sources of complications in this scheme. First, new ^" to be it This Skolemization, however, introduces additional is used: theory (augmented with the current residue), then not g so that the suggested that this can be dealt with by introducing the equality predicate as an the use of negation-as-failure g. is selected literal g to be Skolemized. This to be unified with a specific term The procedure which we have just not Gn needs only be existentially quantified for Skolem constant Thus, when a Skolem constant sk is <— if said to be the residue corresponding to abducible predicate and to add the theory of Free Equality (FEQ) abduced {g' <— }. with respect to the abductive framework is we require that the would have been unnecessarily strong. In Aj U X. resolvable with g. If the Skolemization is G = and Aj+i TU A,U substitutions leading to the refutation, restricted to the because variables in the unit clause This less p, and constitutes the abductive answer In the abduction step above, is G, Gj-i-i is set of unit clauses obtained from the composition of Skolem constants, and variables replaced by be a derivation of we have The accumulated clause. all its consistent with I, then The sequence obtained <T,A,I>. A g' is it i.e., if is the current goal clause 3 cannot be proven from the assumed to be may happen additional facts are abducted. literal of To avoid true. There are two that g becomes provable later in this, not g needs to be recorded clauses which are subsequently added do not violate this implicit assumption. Second, negation may be nested. Suppose there is a clause given by g <— not not provable from the current residue. Then an attempt with negation-as-failure (SLDNF) will fail because it is /i, abduction This procedure can be generalized to any and abduction at odd is not possible to prove h. used instead and level of nesting, levels. 34 and that h is to prove not g using SLD-resolution might be rendered provable by adding further clauses to the residue. SLD-resolution to try to show h, with is However, h So rather than using allowed to add to the residue. SLD being used at even levels, Query Answering 5.2 Figure 6 illustrates how perspective, queries COIN Framework queries are evaluated in a Context Interchange system. and answers are couched knowledge-level) query thereof), in the is satisfying the query). (a a user data model: a (data-level or in the relational formulated using a relational query language and answers can either be intensional From (SQL or some extension mediated query) or extensional (actual tuples Examples of these queries and answers have been presented earlier in Section 2.1. COIN Query Data/Knowledge-Level Query $R Abductive Query G © Intentional Answer ~ " ~ Domain Model + Clark's Theory FEQ Context Set ^ ' Extensional Answer - - -1 „ Integrity Constramts ^ Elevation Set Abducibles Source-to-Context Mediated Query T mapping - evaluable-predicates Source Set extensional predicates 1 G Query Optimizer/Executioner Abductive Answer COIN Framework System Perspective Figure 6: A summary of how Abductive Framework queries are processed within the Context Interchange strategy: query to a well-formed COIN query; @ performs the COIN to Datalog"'^6 translation; @ is the abduction computation which generates an abductive answer corresponding to the given query; and ® transforms the answer from clausal form back to SQL. (D transforms a (extended) SQL 35 Transformation to the COIN Framework Within the COIN framework, the SQL-like queries originating from users are translated coin language. For example, queries Ql and Q2 clausal representation in the to a in Section 2 can be mapped to the following clausal representations: CQl: (- answer {N,R). answer{N, R) <- ri(N, R, .),r2{N, E),R> E and correspondingly, CQ2: <^ answer {N,FuF2) answer{N, Fi, F2) <r- ri Fi The above (N, R, / _), R[scaleFactor{ci)^Fi], R[scaleFactor{c2)^F2], F2. queries however do not capture the real intent of the user. For example, there is no recognition that "revenue" and "expenses" have different currencies and scale-factors associated with them and should not be compared "as the method scaleFactor C2 is that R in CQ2 is a primitive-object for which not defined, or the fact that both queries originate from context which may be interpreted "naive", is", differently in a different context. We say that these queries are and thus must be translated to corresponding "well-formed" Definition 7 Let <Q, c> be a naive query in a COIN framework .F, queries. where c denotes the context from which the query originates. The well-formed query Q' corresponding to <Q, c> is obtained by the following transformations: • replace all relational operators with their "semantic" counterpart; for example, X>y is c replaced with • make all X > Y. relational "joins" explicit by replacing shared variables with explicit equality using the semantic-operator =; for example, ri{X,Y),r2{X,Z) would be replaced with n{XuY),r2{X2,Z),X,^X2. • similarly, make relational "selections" explicit; thus, ri {X, a) will be replaced by ri{X, Y), c Y= a. • replace all references to extensional relations for • example, ri{X,Y) will with the corresponding semantic-relations; be replaced with r[{X,Y). append to the query constructed so far, value elements that are requested in the query. 36 atoms that return the value of the data Q Based on the above transformation, the well-formed query corresponding to naive queries <CQ1,C2> and <CQ2,C2>, CQl': <r- are given by answer(N,R). answer{N, R) ^ r[{N[, R', _), r'^iN^, E'), N[ 1 N^, R' > E\ N[[value(c2)^N],R'[value{c2)^R] and CQ2': <r- answer {N,FuF2). answer{N, Fi, F2) <- r[(N', F{ R',J), R'[scaleFactor{ci)^F[], R'[sca}eFactor{c2)^F^], % Fi,N'[value(c2)->NlF[[value{c2)^FilF^[vaIue{c2)^F2]. respectively. Transformations to The cin Abductive Framework COIN framework and an abduction framework can now be relationship between a Definition 8 Given the COIN framework corresponding abductive framework • T • I is given by consists of the integrity constraints defined in S, [10]; and fi; and the system predicate which calls. a well-formed query in the COIN framework .Fc, the corresponding abductive framework of which is WD UC U the built-in predicates corresponding relational (comparison) operators, provides the interface for system that <— q(X) £ and to arithmetic is can be mapped to a augmented with Clark's Free Equality A consists of the extensional predicates defined in 5, Suppose <— q(X) this <T, X, A> where the Datalog"^^ translation of the set of clauses given hy Axioms • T^ Tc = <S, fi,£,'DX>, is denoted by identical in Ta — <T,T,A>. Without any both J^c and COIN, and any COIN query <— Q{X) can J^a- This is is a sublanguage of always be transformed to a Datalog"^^ query <— q{X) •(— Q{X) into the theory Given an abductive framework <T,X,>1>, and the query 3Xq(X). Suppose an abductive answer for q(X)9, then we assume loss of generality, because Datalog"*'^ by adding the Datalog^'^^-translation of the clause q(X) is stated. it follows that T\={q{X)e<-pu...,Pm) 37 T. A= {pi,...,Pm} This result follows from the fact that Suppose q[X)6. and step, <p is = {sko,...} pi for A i • we say 3X q{X){6ip). that the tuple This fact is • = • 1, . , . . Apm m, so a set of ground atoms constitutes a precondition for the set of Skolem constants introduced by the abduction is a "reverse" substitution {ski/Yi} where not in X. Then, query K ground The conjunct represents their conjunction. in fact pj's are ( 3Y (pi, . . . sfc, € K and Vj a distinct variable is an intensional answer ,Pm)</', ^V') is not surprising given that Motro and Yuan [39] for the suggested that intensional answers can be obtained from the "dead-ends" of "derivation trees" corresponding to a query. fact a naive Although it was not recognized as such, the procedure described implementation of SLD+ Abduction in [39] is From (without any consistency checking). in the perspective of the user issuing a naive query, the intensional answer can also be interpreted as the corresponding mediated answer: An an illustration of the preceding comments, the evaluation of CQ2' in the abductive framework yields the following abductive answer: A= The {ri{sko,skusk2),sk2 ='3PY'},e reverse substitution (equivalently, the (p is = {7V/sfco, Fi/1 000, F2/I} given by {sfco/Vo, s/^i/Vi, 5^2/^2}, and thus the intensional answer mediated query) is: {3Yo,YuY2iri{Yo,YuY2),Y2 ='JPY'), {TV/Fo, Fi/l000,F2/l} which translates to answer for the MQ2 shown in Section 2. If ) {Vo/'NTT', Vi /I 000 000, Ya/'iPY'} above mediated query, then the answer for the original user query is is an given by {A^/'NTT',Fi/l000,F2/l}. Illustrative 5.3 In this section, Example we provide an example illustrative of the computation involved in query medi- ation (equivalently, obtaining the intensional answer to a query). Consider the query Q3 (a simplified variant of Q2) which queries relation ri for the scale-factors of revenues in context Q3: issued from context ci is c\ : SELECT rl.cname, rl .revenue. scaleFactor IN cl FROM rl; The (well-formed) clausal representation for this query CQ3: is given by <- aiiswer(Ar, F). answer(iV, F) <- rV{N', /?',.), N' [value (ci)^N], R' [scaleFactor {ci)-^F'], F'[value{ci)^F]. 38 , which SLD+Abduction algorithm Figure 7 shows one possible refutation of this query using the For better scribed earlier. and the refutation shown using COIN is clauses used for resolving the goal clauses are those The Catalog. clarity, shown de- clauses rather than earlier in Figure 3, 4 5. we aid in appreciating the chain of reasoning, To on the offer the following highlights refu- tation: refutation begins with the query as given, with • The • At step (3), form, r\(sko,ski,sk2), At step (6), the is ri is set. an extensional Skolemized its added to A. literal scaleFactor(ci two different clauses (where F Since removed from the goal clause and is it empty initialized to the the literal ri{N,.,.) cannot be further resolved. predicate (and hence abducible), • A F =1 , be resolved with /ri^revenue(sfco))[vaiue(ci)—>F] can F and =1000). One chosen arbitrarily is (in this case, =1); the other will be selected on backtracking and will eventually lead to another refutation. • To not be case, ' JPY' when evaluated is it in context c\ To determine (see step (8)). the expansion of the goal clause as shown • In step (12), the extensional relation tion, we are not allowed we will • In step (15), to is the (see r\ This eventually leads to referenced again. In the absence of other informa- assume that it is the fact, same "fact" which hcis been abducted: r\(sk3,sk4,sk5) to A. the equality constraint on the objects /ri#cname(sfco) and /ri#cname(sfc3) leads added to A. At = sk^. Since '=' merge the two is abducible this point, the functional = generates further the constraints ski facts ri(sA;o,sA;i,.sA;2) Finally, in step (17), the literal sk2 = and The -"^k^ and sk2 is an evaluable predicate), = sk^, which in it is turn allow us to abducted, which leads to a refutation. substitution, restricted to variables is is ri (5^3, 5^4,5^5). 'JPY' This intensional answer, translated to SQL, (it dependency cname -^ {revenue, currency} abductive answer corresponding to this refutation 'JPY'}. 5). is in step (10). need to add a new Skolemized to the constraint skg • this if necessary to identify the currency value from the extensional relation corresponding axiom for assigning currency values in Figure i.e., hand must arrive at a successful refutation, the currency for the revenue-object at is {N,F}, given by: 39 given by is A= {r\ (sko, sk\ , .sA;2), The sk2 given by {N/skQ, F/l}. — (1) ^as{N,F). 1 Ao = {} SELECT rl.cname, On 1 FROM rl WHERE rl. currency <> 'JPY'; F =1 000 backtracking, the other solution corresponding to answer returned to the user MQ3: will be obtained. The complete thus given by: is SELECT rl.cname, 1 FROM rl WHERE rl. currency <> 'JPY' UNION SELECT rl.cname, 1000 FROM rl WHERE rl. currency = 'JPY'; The correspondences between clearly seen in the constraint (sko = above example. At step ^fca) to and semantic query optimization can be integrity checking (15), the functional be propagated and eventually allow from the abductive answer. SELECT rell.cname, were not r^ allows ri(sA;3, sA;4,5/j5) to the initial be eliminated the intensional answer obtained would instead be: If it 1 FROM rl rell, rl rel2 so, dependencies WHERE rel 1 cname = rel2 cname . . which would include a redundant second reference obviously would lead to suboptimal performance This second answer to r\. if unintuitive, is and executed without further optimization. In the more general scenario, constraints can be useful in pruning an entire refutation altogether. For instance, Q3 had if been: SELECT rl.cname IN cl, rl .revenue. scaleFactor IN cl Q3': FROM rl WHERE rl. currency = 'JPY'; we in will eventually end up trying to abduct A, thus resulting in sk-^ an unsuccessful refutation. In consist of only the second select-statement in 6 A (i.e., C = {c\ Two 7^ 'JPY' is already present mediated query MQ3' will MQ3. COIN Framework COIN framework We := Ci,...,c„ :— Cn})- framework which allows new contexts to be defined fashion. this case, the Meta-Logical Extension to the In Section 4.3, context knowledge in a theories =' JPY' where sk2 basic mechanisms underly this is represented by a set of separate describe here an extension to this basic terms of existing ones in move to such an extension: in an incremental the treatment of context as a set of parameterized statements and the introduction of the hierarchical operator which defines a subcontext relation on the set {ci , . . , . -<, c^i }. Recall that the relative truth or falsity of a statement can be represented using McCarthy's ist, so that 41 ist(ct,a) is mean taken to a that the statement is true in context make incremental refinements to statements which describe enclosing context. Thus, a subcontext of if c, is a differential context denoted by ist(Ct,a) <- cr ist(ci,a) <— c, G Cj ^ c, is already Cj, this -< allows us to known about an allows us to introduce such that: (Jc^, ^ not-overridden{6c,,cr)- Cj, ist{cj,a), indicates that the statement a obtained from the more general not explicitly overridden by the differential context. is context theory of in denoted by what relation (Jci The predicate not-overridden context Cj Cj, The Cj. from Cj and 6^ is similar to that accomplished The composition of a new by the isa operator defined [4]. In the COIN data model, statements references to in a context are "decontextualized" by making explicit the form of a context-object. For example, the statement its reification in ist(cj,t[mi-^t'] <- t[m2^t']). can be equivalently stated as t[mi{cj)^t'] <- t[m2(cj)^t']. This second form simplifies the inferences which are undertaken to support context mediation, but requires some adjustment to allow statements to be inherited. statement This is inherited by context is c, accomplished by requiring 6c,iX) = (^ all Cj), we Specifically, if need to replace the references to will statements in Sc^ to be parameterized; the above Cj with c,. i.e., {a^(X),...,ai[X)} For instance, the earlier statement would have been asserted as aiX) = t[in^{X)^t'] ^ t[m2{X)^t']. The statement a{X) in the set 6c is said to axioms forms an uninstantiated context Definition 9 Let 6C said to be the contexts {ci, relation -<. . = {cq := . , c„}. Let {c,, Then the from Ci{X) as before: , set. Co{X)JcA^)^- differential for context . be uninstantiated. The collection of uninstantiated . , . . Ci Ct^ } ^^c^i^)}^ for which with respect to -<, i.e., 42 Cj^, (i = l,...,7i) is which defines a partial order on the be the predecessors of uninstantiated context set for (^^IJ^-) q with respect to the subcontext denoted by Cij(X), can be obtained • a{X)GQ^(X)^a(X)E6,,^iX) • a(X) G QjiX) The context co is <- a(X) e Q{X), not-overridden{6c {X),a{X)). said to be the default context and forms the basis other differentials. for the Q Definition 10 Given 6C Suppose Ct(X) The {cq '— Co{X),6c^{X), . . the uninstantiated context set for context set for Notice that is is = c, is ,^c„(^)} and the subcontext relation . obtained inductively using Definition c, Q to determine whether or not a given statement is being overridden in a specific context. The simplest approach to this in the method) defined in head any of the type companyFinancials is in a differential context its set, be used in the V assume that whenever a supercontext applies. For example, if rules (pertaining the scaleFactor for given in two distinct context differentials along a given path in is said to take precedence and COIN framework. leads to the following extended formulation of a Definition 11 The extended coin framework and to corresponding context. The above scheme <S,£, is none of the other the hierarchy, then the statement in the more specific context will 9. given by the set Ci(ci). we have not described how one method atom appears -<. are defined as before in Definition is a sextuple given by <<S, 6, subcontext relation defined on the set of contexts 6C P, 6C, as defined in Definition is {ci, /x, £", . . . , 7 The Context Interchange Prototype The feasibility 9, ^ >, where and the -< is c„} induced by 6C. and features of the proposed strategy have been demonstrated in a prototype which provides mediated access to both traditional structured databases and semi-structured data sources (web-sites). Our implementation leverages on the world- wide- web number in of ways: for (WWW) in providing physical connectivity across different networks and platform, adopting a universal addressing scheme to different types of geographically-distributed sources, and a for providing us re- with a wealth of heterogeneous data sources. As shown in Figure 8, queries submitted to the system are intercepted by a Context Mediator, which rewrites the user query to a mediated query. The Optimizer transforms this to an optimized query plan, which takes into account a variety of cost information. The optimized query plan is executed by an Executioner which dispatches subqueries to individual systems, collates the results, undertakes conversions which may be necessary when data are exchanged between two systems, and returns the answers to the receiver. 43 CONTEXT MEDIATION SERVICES LxjcaJ DBMS supporting intermediate processing subquery // subquery answers Non-traditional Data Sources Local (e.g., DBMS Figure 8: Architectural overview of the Context Interchange Prototype The Context Mediator is implemented implementation distributed by the interpreter in ECLiPSe^^, which ECRC. At is an efficient and robust Prolog the heart of the Context Mediator which implements the extended SLD+Abduction algorithm described Since computation of the abductive answer we need to translate web-pages) both the user-query is is a meta- in Section 5.1. performed within a Horn-clause (HC) framework, as well as COIN clauses to statements in Datalog"''^^ and on obtaining the answer, perform the reverse translation to SQL. In the absence of aggregation operators, the SQL-to-HC and HC-to-SQL compilers of these languages shares a 8 We common grounding are relatively straight-forward since both in predicate calculus. Conclusion have presented a tightly-woven tapestry of ideas derived from different threads in the lit- erature in artificial intelligence (on "contexts"), databases (on "heterogeneous databases" and "semantic query optimization"), logic programming (on "abductive logic programming" and ^^ECLiPSe: The ECRC Constraint http //www ecrc de/eclipse/. . Logic Parallel . 44 System, More information can be obtained at "meta- logic"), and others which are already present at the confluence of different scholarly traditions "deductive object-oriented data models" and "intensional answers"). (e.g., The vari- ous results and insights integrated together in a formal framework for the Context Interchange and provide a well-founded basis strategy, for representing disparate sources and receivers. Specifically, tics in we have described how data semantics disparate systems can be articulated using a "object-logic" ticular, and reasoning about data seman- , and how logical inferences (in par- abduction) can be used to provide mediated access to both data and data- semantics. At the same time, we showed that the COIN framework presents a viable alternative to and contemporary integration approaches by by allowing more in easily accessed, classical different kinds of information to be by making possible the sustenance of an infrastructure that mitigates the complexity in the creation and maintenance of large-scale systems, and by isolating changes in different components which are only loosely-coupled together. This paper are many is by no means the interesting issues which last we word on Context Interchange. are only beginning to explore. On We the contrary, there mention below two of these undertakings. As noted in [35], the autonomy and heterogeneity of sources present new challenges for query processing and optimization which are not the same as those in distributed database systems. These differences stem from constraints which are characteristic of the underlying ensources may differ in their not be known, and data conversions may incur large hidden costs which are not accounted vironment; for example, may for previously. difl'erent As we have shown earlier, query-handling ability, cost the detection of unsatisfiable answers in the abductive framework constitute a form of semantic query optimization which presents huge recently [19], We and have found much synergy between the abduction framework, semantic query optimization, and constraint make payoffs. embarked on a re-implementation of the abduction procedure using Constraint Han- dling Rules tions models logic programming which motivated the work recently presented [25] in [53]. on the premise of similar observa- To this end, we have been able to use of the existing prototype as a testbed on which theoretical insights can be rapidly implemented and experimented with. The richness of the representational formalism greater scope for abuse. While it is is a two-edged sword since unlikely that there will ever be a it presents also "definitive guide" to context modeling, case studies, evaluation criteria, prescriptive guidelines, and tools are in dire need. At this moment, we are working with several industry information-providers in applying this mediation technology to the "real world" problems encountered by them. We are hopeful that these experiences will be instrumental in developing and validating integration methodologies that are grounded in practice. 45 References [1] Abiteboul, of the [2] ACM SIGMOD Ahmed, A., Lausen, G., Uphoff, H., and Walker, E. Methods and S., Conference (Washington, DC, Smedt, P. R., D., May rules. In Proceedings 1993), pp. 32 41. Du, W., Kent, W., Ketabchi, M. A., Litwin, W. and Shan, M.-C. The Pegasus heterogeneous muitidatabase system. A., Raffi, IEEE Computer 24, 12 (1991), 19-27. [3] Atkinson, M. S. P., Bancilhon, F., DeWitt, D., Dittrich, K., Maier, D., The object-oriented database system manifesto and Zdonik, In Proc. First Int'l. Conf. on (invited paper). Deductive and Object-Oriented Databases (Kyoto, Japan, Dec. 4-6 1989). [4] BORGi, A., Mancarella, p., Pedreschi, D., and Turini, F. Computational Logic: Symposium Proceedings, logic theories. In Brussels, [5] Nov J. Compositional operators W. for Lloyd, Ed. Springer-Verlag, 1990, pp. 117 134. Brogi, a., and Turini, F. Metalogic for knowledge representation. In Proceedings of the 2nd International Conference on Principles of Knowledge Representation and Reasoning (KR-91) (Cambridge, [6] MA, Apr 22-25 Buvac, in 1991), pp. 61-69. S. Quantificational logic of context. In Proceedings of the Knowledge Representation and Reasoning (held Workshop on Modeling Context at the Thirteenth International Joint Conference on Artificial Intelligence) (1995). [7] Chakravarthy, U. optimization. [8] Chang, ACM C.-L., S., Grant, J., and Minker, J. Transactions on Database Systems and Lee, R. C.-T. Symbolic Logic-based approach to semantic query 15, 2 (1990), 162-207. Logic and Mechanical Theorem Proving. Academic Press, 1973. [9] Chen, W., and Warren, D. Symposium on [10] S. C-logic for complex objects. In Principles of Database Systems Clark, K. Negation as failure. In Logic and ACM SIGACT-SIGMOD-SIGART (PODS) (March 1989), pp. 369-378. Data Bases, H. Gallaire and J. Minker, Eds. Plenum Press, 1978, pp. 292-322. [11] Collet, C, Huhns, M. N., and Shen, W.-M. Resource base in Carnot. [12] Cox, P. T., Proceedings [13] Daruwala, IEEE Computer 24, 12 (Dec 1991), 55-63. and Pietrzykowski, T. Causes CADE-SO a., for events: their computation and applications. In (1986), J. H. Siekmann, Ed., Springer-Verlag, pp. 608-621. Goh, C. H., Hofmeister, The context interchange network prototype. on Database Semantics (DS-6) (Atlanta, Verlag) integration using a large knowledge S., Hussein, K., Madnick, In Proc of the G A, May . 46 S., and Siegel, M. IFIP WG2.6 Sixth Working Conference 10 - Jun 2 1995). To appear in LNCS (Springer- [14] Denecker, M., and Schreye, D. D. On Int [15] Conf on Fifth Generation the duality of abduction and model generation. In Proc Computer Systems (1992), pp. 650-657. DOBBIE, G., AND TOPOR, R. On the declarative and procedural semantics of deductive objectoriented systems. Journal of Intelligent Information Systems 4 (1995), 193-219. [16] Programming (1988), pp. 562-579. Logic [17] Abductive planning with event calculus. In Proc. 5th International Conference on ESHGHi, K. EsHGHi, K., AND KowALSKi, R. A. Abduction compared with negation by failure. In Proc. 6th International Conference on Logic Programming (Lisbon, 1989), pp. 234-255. [18] Faquhar, A., Dappert, context logic. In A., Fikes, R., and Pratt, W. Integrating information sources using AAAI-95 Spring Symposium on Information Gathering from Distributed Hetero- geneous Environments (1995). To appear. [19] Fruhwirt, Garcia-Molina, MAN, J., H., p. Terminological reasoning with constraint handUng rules. In Papakonstantinou, Y., Quas.s, and Widom, NGITS Proc J. The TSIMMIS approach GOH, C. [22] A., Sagiv, Y., Ull- to mediation: data models and languages. In H., Madnick, S. E., Israel, June 27-29 and Siegel, M. D. Context interchange: overcoming the challenges dynamic environment. International Conference on Information and Knowledge 1 Rajaraman, (Next Generation Information Technologies and Systems) (Naharia, of large-scale interoperable databcise systems in a Dec D., To appear. 1995). [21] and Hanschke, Workshop on Principles and Practice of Constraint Programming (1993). Proc. [20] T., In Proceedings of the Third Management (Gaithersburg, MD, Nov 29- 1994), pp. 337-346. Gruber, T. R. The Principles of role of common ontology in achieving sharable, reusable knowledge bases. In Knowledge Representation and Reasoning: Proceedings of ference (Cambridge, MA, 1991), J. A. Allen, R. Files, and the 2nd International Con- E. Sandewall, Eds., Morgan Kaufmann, pp. 601 602. [23] GuHA, R. V. Thesis, [24] Department of Computer Science, Stanford University, 1991. Jaffar, to J., AND Maher, M. Constraint logic programming: A survey. Journal of Logic Program- appear (1996). Jonker, W., and ScHiJTZ, H. The p. [27] STAN-CS-9 1-1399- 3 (Sep 1987), 229-257. ming [26] Tech. Rep. Imielinski, T. Intelligent query answering in rule based systems. Journal of Logic Programming 4, [25] Contexts: a formalization and some applications. ECRC multidatabase system. In Proc ACM SIGMOD (1995), 490. Kakas, A. C, KowALSKi, R. and Computation 2, A., and Toni, F. Abductive logic programming. Journal of Logic 6 (1993), 719-770. 47 [28] [29] KlFER, M., Lausen, G., and Wu, J. Logical foundations of object-oriented and frame-based languages. JACM KuHN, AND LuDWiG, T. VIP-MDBMS: A E., 4 (1995), 741 843. logic multidatabase system. In Proc Fnt'l Symp. on Databases in Parallel and Distributed Systems (1988). [30] Landers, T., and Rosenberg, R. An overview Symposium for Distributed Databases [31] Lenat, D. B., and Guha, R. of Multibase. In Proceedings 2nd International (1982), pp. 153-183. Building large knowledge-based systems: representation and infer- ence in the Cyc project. Addison- Wesley Publishing Co., Inc., 1989. [32] W. 0*SQL: A LiTWiN, of the language for object oriented multidatabase interoperability. In Proceedings IFIP WG2.6 Database Semantics Conference on (Lome, Victoria, Australis, Nov 1992), D. K. Hsiao, E. Interoperable Database Systems (DS-5) J. Neuhold, and R. Sacks-Davis, Eds., North-Holland, pp. 119-138. [33] Litwin, W., and Abdellatif, A. IEEE MDSL. Proceedings of the [34] Lloyd, J. [35] Lu, H., Ooi, B.-C, AND GOH, C. H. W. 5 (1987), 621-632. -75, logic of the multi-database manipulation language programming, 2nd, extended ed. Springer- Verlag, 1987. On global multidatabase query optimization. ACM SIGMOD 4(1992), 6-11. fiecorrffO, [36] Foundations of An overview Malone, T. W., and Rockart, J. F. Computers, networks, and the corporation. Scientific American 265, 3 (September 1991), 128-136. [37] McCarthy, J. Generality in J., and Hayes, artificial intelligence. Communications of the ACM 30, 12 (1987), 1030-1035. [38] McCarthy, P. J. Some philosophical problems from the standpoint of AI. In Readings in Nonmonotonic Reasoning, M. Ginsberg, Ed. Morgan Kaufrnann, 1987. [39] Motro, a., and Yuan, Q. Querying database knowledge. In Proceedings of the ACM SIGMOD Conference (1990), pp. 173-183. [40] MuMiCK, In Proc. [41] I. S., and Pirahesh, H. Implementation ACM SIGMOD (Minneapolis, MN, May Papakonstantinou, Y., Garcia-Molina, geneous information sources. In Proc IEEE H., of magic-sets in a relational database system. 1994), pp. 103-114. and Widom, J. Object exchange across hetero- International Conference on Data Engineering (March 1995). [42] Quass, D., Rajaraman, a., Sagiv, Y., Ullman, heterogeneous information. J., and Widom, In Proc International Conference Databases (1995). 48 J. Querying semistructured on Deductive and Object-Oriented [43] SciORE, E., SlEGEL, M., AND ROSENTHAL, A. Context interchange using meta-attributes. Proceedings of the 1st International Conference on Information and Knowledge Management In (1992), pp. 377-386. [44] SciORE, E., SiEGEL, M., AND ROSENTHAL, A. Using Semantic values to among heterogeneous information systems. ACM facilitate interoperability Transactions on Database Systems 19, 2 (June 1994), 254-290. [45] Seshadri, p., Hellerstein, VASTAVA, D., Stuckey, P. M., Pirahesh, H., Leung, T. J. J., AND SUDARSHAN, ACM SIGMOD and implementation. In Proc. C, Ramakrishnan, R., Sri- S. Cost-based optimization for magic: algebra (Montreal, Quebec, Canada, June 1996), pp. 435- 446. [46] Shanahan, M. Prediction deduction but explanation is is abduction. In Proc. of the IJCAI-89 (1989), pp. 1055-1060. [47] Sheth, a. p., and Larson, J. A. Federated database systems geneous, and autonomous databases. [48] SiEGEL, M., AND Madnick, S. ACM of the 17th International Conference on Very Large [49] Templeton, M., Brill, Gregor, R. Mermaid IEEE [50] 75, 5 (1987), TOMASic, of DISCO. A., D., Dao, S. K., Lund, managing distributed, hetero- 22, 3 (1990), 183-236. Computing Surveys A metadata approach for to solving semantic conflicts. In Proceedings Data Bases (1991), pp. 133-145. E., Ward, P., Chen, A. — a front end to distributed heterogeneous databases. L. P., and Mac- Proceedings of the 695-708. Raschid, L., and Valduriez, p. Scaling heterogeneous databases and the design In Proceedings of the 16th International Conference on Distributed Computing Systems (Hong Kong, 1995). [51] Ullman, J. Principles of database Computer Science Press, a result of domain evolution. ACM and knowledge-base systems, Vol II. 1991. [52] [53] Ventrone, v., SIGMOD Record Wetzel, G., 20, JICSLP R., Semantic heterogeneity cis and Toni, F. Procalog: Programming with constraints and ab- Poster session, Sept. 1996. Wiederhold, G. Mediators 25, 3 S. 4 (Dec 1991), 16-20. Kowalski, ducibles in logic. [54] and Heiler, in the architecture of future information systems. (March 1992), 38-49. Soon 49 ~- ri r^ IEEE Computer Date Due Lib-26-67 MIT LIBRARIES 3 9080 01428 1940