'-'iiVsCf HD28 .M414 WHERE DOES THE DATA COME FROM: MANAGING DATA INTEGRATION WITH SOURCE TAGGING CAPABILITIES Y. Richard Wang Stuart E. Madnick August 1990 CISR Sloan WP WP No. 214 No. 3191-90 Center for Information Systems Research Massachusetts Institute of Sloan School of n Technology Management Massachusetts Avenue Cambridge, Massachusetts, 02139 WHERE DOES THE DATA COME FROM: MANAGING DATA INTEGRATION WITH SOURCE TAGGING CAPABILITIES Y. Richard Wang Stuart E. Madnick August 1990 CISR Sloan ®1990 WP WP Y.R. No. 214 No. 3191-90 Wang, S.E. Madnick An earlier version of this paper entitled, "A Source Tagging Theory for Heterogeneous Database Systems," has been accepted for presentation and publication at the 11th International Conference on Information Systems, 1990. Center for Information Systems Research Sloan School of Management Massachusetts Institute of Technology 1. Introduction Source Tagging Example 1.2 Research Issues and Goals 1.3 Research Background And Assumptions The Polygen Model 2. 2.1 The Polygen Algebra Polygen Query Translation 3. Example Source Tagging in the PQP 4. The Necessary and Sufficient Condition of Source Tagging 5. Summary and Conclusions 6. Appendix: The Operations that Generate Table 5 References 1.1 ACKNOWLEDGEMENTS 1 3 6 8 10 11 15 19 21 23 25 27 Work reported herein has been supported. In part, International Financial Service Research Center. MITs Center for Information Research, and MITs IXS Digital Library Project. by MITs Systems Where Does the Data Come From: Managing Data Integration with Source Tagging Capabilities ABSTRACT Many important Management Support Systems require access to and seamless This paper studies heterogeneous database systems from the source persp>ective. It aims at addressing issues such as the following: (1) Where is the data from? (2) Which intermediate sources were used to arrive at that data? Sp)edfically, it presents a polygen model for resolving these data source and intermediate source problems. The polygen model provides a precise characterization of the source tagging problem and a solution including a pHDlygen algebra, a data-driven query translation mechanism, and the necessary and sufficient condition for source tagging. This model has been developed as a direct extension of the relational model to the multiple database setting with source tagging capabilities, thus it enjoys all of the strengths of the traditional relational model. Source knowledge is imp>ortant for many reasons. It enables users to apply their own judgment to the credibility of the information. It enables users to rationalize and reconcile data inconsistencies. It enables system designers to develop access charge systems. It enables an application user to adjust data. And it enables a system to interpret data semantics more accurately. In sum, source tagging capabilities should be a required functior\ality for future heterogeneous database integration of multiple heterogeneous database systems. systems. Introduction 1. The rapidly increasing complexity, interdependence, and competition profoundly changed how corporations operate and competitive advantage in the marketplace. communications capability and data It how in the global market has they align their information technology for has been argued (Madnick, 1989) that improved accessibility will lead to integration of systems both within and across organizational boundaries in the 1990s. This will lead to vastly improved group communications and, more importantly, the integration of business processes across traditional functional, product, and geographic effective lines. The integration of business processes, Management Support Systems and management (Rockart require access to & for more Short, 1989). Increasingly, many important Management Support Systems and seamless integration of multiple heterogeneous database systems. These types & Embley, Lyngbaek & McLeod, Litwin, et al., Madnick, for product development, product delivery, and customer service heterogeneous database systems have been referred Rusinkiewicz, demands in turn, will accelerate & 1987; Elmasri, Larson, 1983), Multidatabases (Ferrier 1982), or Composite Information to as Federated Database Systems (Czejdo, Navathe, 1987; Heimbigner & Systems of Strangret, 1982; Litwin (Madnick, Siegel, & & McLeod, 1985; Abdellatif, 1986; & Wang, 1990; Wang & 1988). In this paper, we study heterogeneous database systems from the multiple source perspective. 1 In particular, we address the following two issues: (1) Where is the data from? Which (2) intermediate data sources were used to arrive at that data? interesting to note that these issues have not been directly addressed is It Contemporary heterogeneous database systems underlying databases in order anonymous strive to encapsulate the heterogeneity of the produce an illusion that all information originates from a single This illusion has been referred to as location transparency or location independence source. (Date, 1990). to we have found In our field studies of actual needs, simplicity of location transparency for query formulation, they also piece of data retrieved (e.g.. are. is want the that although users want know to Customer Database). Source: Corporate responsible for marketing, production, or finance, the data date. to the source of each Most managers, be they would not be much concerned about how independent from physical storage or how distributed and heterogeneous the physical database systems Their primary concern about data As such, knowing is whether it the source of each piece of data could decision making them many facilitate their may be important to for processes. reasons, for example: • Source knowledge enables managers to apply their information. useless to • own judgment In our discussions with managers, several exclaimed that data them unless they know its to rationalize PLC retrieved from LP. Sharp's Disclosure database (based in Toronto) Dataline database (based in London). It is countries from which the financial data • their help. knowing when Canada the source Furthermore, since Reuters company, the Dataline database may be more appropriate. and For compared with Finsbury's in values, thus was compiled would the data sources helps managers to rationalize make has different values likely that different accounting practices in and the United Kingdom would explain the difference UK -based totally and reconcile data inconsistencies. example, the attribute "Return on Equity" for Reuters Holdings a would be source. Source knowledge enables managers is to the credibility of the In short, knowing reconcile the data inconsistencies as well as own judgment. Source knowledge enables a production irunager to adjust data. In a manufacturing firm that we interviewed, production data was extracted from plants across the country in order to produce production reports. On the days vice versa, the production volume could be were based on 23 hours instead apply to all plants; when standard time less is switched to daylight saving time and than the regular volume because the aggregates of the regular 24 hours. Note that this does not necessarily Arizona, for instance, does not participate in the daylight saving time program. With source knowledge, the production data due time zone differences could be to adjusted appropriately. • Source knowledge enables system designers to develop access charge systems. financial institution, analysts have access to In a major With multiple external commercial databases. data source knowledge, system designers could develop systems needed for internal charge back schemes. For example, different charges could be associated with data actually returned to the user versus intermediate data used in the query process. • Source knowledge enables a system For example, it is to interpret typical for country-specific databases to used. Thus, a product may have more data semantics a price of 2416.95. Is it in omit explicit indication of currency U.S. dollars? Japanese yens? or U.K. pounds? Knowing the data source can often provide the necessary the data source exception. is Japan, then the system can assume accurately and completely. In this case, clarification. that the currency is in yen unless it is if an This information can be used in conjunction with other rules to determine the data semantics correctly. Indeed, it has been suggested^ that knowing the data source required feature for heterogeneous database systems. is so important that it should be a Providing source tagging capabilities for heterogeneous database systems requires an understanding of the constraints involved both organizationally and technically. Most organizations must deal with pre-existing information systems which have been developed and administered independently, and are likely One to remain so. Many of these systems are of our technical colleagues has suggested that data source tagging capabilities be thirteenth rule for distributed database systems (Date, 1990). added as Date's controlled by autonomous subsidiaries or even separate corporations that are reluctant or unwilling to to change this problem by tagging data source Technically, research results. It it file have important to order to allow for data integration. been retrieved from a after the data has to would enable us such as the capabilities the is in Dow Jones financial services) This implies that one should not require data their systems. be augmented in a pre-existing information system (e.g., We resolve local database. develop source tagging capabilities based on previous database to enjoy all of the strengths of the conventional database systems, allow for data sharing by multiple users concurrently and handling details from the concern of application programmers. In order to to remove many accomplish this, we understand the trade-offs of different data models, the mechanisms used in these models to of to perform data definition and data manipulation, and develop a new algebra and a query processing mechanism for facilitating source tagging capabilities. Our research contributions can be summarized as follows: (1) We have developed a polygen modefi (poly) source (gen) perspective. to study heterogeneous database systems from the multiple The polygen model provides a precise characterization of the source tagging problem and a solution including a polygen algebra, a data-driven query translation mechanism, and the necessary and provided (2) We to illustrate the basic sufficient condition for source tagging. A concrete example is also mechanism. have developjed the polygen model as a direct extension of the relational model to the multiple database setting with source tagging capabilities, thus the polygen model enjoys all of the strengths of the traditional relational model. (3) We have established a theoretical foundation for resolving For example, the polygen algebra can be extended data, such as the temporal aspect of data. To to query," and so on, and so forth. other critical research issues. address other basic attributes associated with Users normally want to know not only where the data is model" will be used in the paper instead By the same token, "polygen query" will be used instead of "global highlight the source tagging problems, the phrase "polygen of the conventional "global model." many from but also when the data was collected and how earlier, knowing it was collected. Furthermore, as we motivated the data source will enable a user or a query processor to interpret the data semantics more accurately; knowing data source credibility will enable the user or the query processor to hjrther resolve potential conflicts amongst the data retrieved from different sources; and knowing data access n cost will enable system designers to develop access charge systems. SOURCE TAGGING EXAMPLE In preparing a special rejxjrt^ member of the ComputerWorld from the school wUh an on the top ten graduate programs staff called MBA degree. one of the schools Suppose to get the that the following in Information Systems, a names of CEO's who graduated SQL polygen query SELECT ONAME, CEO FROM PORGANIZATION, PALUMNUS WHERE CEO = ANAME AND DECREE = "MBA" was created given a polygen schema derived from the For expository purposes, the prefix "P" Polygen Schema is used to Alumni Database and Company Database below. denote a polygen scheme in the p>olygen schema. order to select those that the it CEOs who received an MBA degree. Moreover, the query processor needs to 'Tcnow" has to merge the BUSINESS and the FIRM relations ANAME attribute. As such, the challenge is to first before joining the CEO attribute with develop not only a polygen model but also a pwlygen algebra and the algorithms for a polygen query processor capable of resolving the data and intermediate source tagging problems for any arbitrary polygen query. Tagging the name accurately to the result is referred to as the Data intermediate use of the Alumni Database accurately is Company Database Source Tagging problem. Tagging the referred to as the Intermediate Source Tagging problem. 1^ RESEARCH ISSUES We AND GOALS have reviewed a broad range of literature The systems heterogeneous distributed database systems. Madnick, 1988) included (Deen, Amin, & C, 1986; Litwin, et al., MULTIBASE 1987; Deen, Amin, in the & In addition, 1982). and examined various research prototypes of that we United States (Smith, Taylor, 1987), and MRDSM we have surveyed more studied (Gupta, 1989; et al., 1981), PRECI* in France'^ (Litwin & in Wang & England Abdellatif, than forty U.S. commercial systems offering partial solutions to the heterogeneous distributed database problem, including Data Integration's MERMAID, (Gupta, et 1989). al., To Cincom's SUPRA, Metaphor's DIS, and TRW's Data Integration Engine the best of our knowledge, none of these systems have dealt with these source tagging problems. Two related issues, among others, polygen model should be created relationship between the polygen Most heterogeneous (Hull & King, 1987; in need to be addressed in source tagging: order to tag multiple sources explicitly? model and the polygen query processing distributed database systems adopt Peckham & Maryanski, (1) (2) What kind What is of the facility? one of the following four data models 1988): the Relational the Semantic Database Model, or the Entity Relationship Model. Model, the Functional Data Model, Each data model has merits for its MRDSM, an administrator may define for any collection of databases a collective name called a multidatabase name. For instance, the databases Michelin, Kleber, and Gault_M may collectively get the name Rest_guides. However, the focus of such names is to simplify the expression of some commands; otherwise, these commands may require an enumeration of the corresponding databases. In intended purposes. For example, both the Functional Data Model and the Sennantic Database Model are rich in semantics and implemented and rich in semantics lends Base itself to is Entity Relationship widely accepted as the leading database design a simple structure and an elegant theoretical foundation. Management Systems dominate been extended (Codd, 1979) to the database market today. we model, The tool. Model relational we is also model In addition, Relational Data Moreover, the relational model has capture semantics such as generalization and aggregation. consider both of the rigorous and pragmatic asp)ects, relational The in operational systems. selected the relational model. In order to Based on the and intermediate define, in this paper, a polygen model for resolving the data source tagging problems. One of the key activities in into a set of local queries, translation has Litwin & in turn are & C, & Thompson, 1987; Deen, & Amin, in a!., & & 1982; Templeton, et & al., Through subtree matching, queries, given the sp>ecific source As we and will discuss later, the for translating a the 1981; 1983). A symbolic query and transformation et al., rules^ into trees are further translated into local our query translation mechanism differs from the above mentioned (1) Instead of the view definition approach which encodes the polygen query into the corresponding mapping algorithm from & Goodman, language syntax descriptions. target techniques in two important aspects: procedure multiway these 1984; Czejdo, 1985; Rusinkiewicz, 1988) in which a syntax-directed parser converts a polygen query trees. Query & Hwang, Yu, 1984; Dayal Strangret, 1982; Katz transformation technique has also been proposed (Rusinkiewicz multiway query translate a p>olygen most heterogeneous distributed database 1986; Brill, Templeton, Taylor, 1987; Ferrier Abdellatif, 1986; Litwin, et is to routed to the corresponding local databases. been approached through view definition systems (Breitbart, Olson, Deen, Amin, which formulating composite information mapping data. As a result, local queries, our mechanism separates adding a new database system does not require modifying the existing procedural view definitions. (2) Instead of the symbolic query transformation technique which tackles a broad range of nodal query languages Each transformation rule contains source part and a target part. For example. a Source: SELECT Target:- Projection attnbute-l FROM relation-1 WHERE to the existing condirion; ((attribute-!). Selection (condition, (relation-1)); at a higher level. our mechanism focuses on the mapping between a polygen algebraic expression and the corresponding local operations, pjermitting entities 1.3 (and attributes) in local databases to overlap one another. RESEARCH BACKGROUND AND ASSUMPTIONS We have developed a heterogeneous database system which currently has access internal databases (the Alumni Database, the Placement Database, three external commercial databases (Finsbury's Dataline The query processor architecture translates schema. The word "polygen" equipped with source tagging local queries capabilities. 1. for the is The PQP and the Student Database) and and LP. Sharp's Disclosure and Currency). Application Query Processor Briefly, the Polygen Query Processor (PQP) based on the used here to signify that the in turn translates the based on the corresponding polygen schema, and routes them (LQP). The details of the is depicted in Figure an end-user query into a polygen query user's application bases is query processor polygen query into a to the Local To the PQP, each LQP behaves as a return from the LQPs, the retrieved data are further processed by the its set of local data local relational system. PQP is Query Processors mapping and communication mechanisms between an LQP and encapsulated in the LQP. to three Upon produce the in order to desired composite information. Many user. critical problems need to be resolved in order to provide a seamless solution to the end- These problems include source tagging, query translation, schema integration & Navathe, 1987), inter-database instance matching (Wang & Madnick, 1989b), domain mapping (DeMichiel, 1989; Shin, 1988), and semantic reconciliation (Wang & Navathe, 1986; Elmasri, Larson, Madnick, 1989a). We focus on the • The data source • The • Schema local is & (Batini, Lenzirini, first two problems and make the following assumptions in this paper: tagged after the data has been retrieved from a local database. schemata and the polygen schema are integration has been performed, and the all based on the relational model. attribute mapping information is stored in the polygen schema. • The inter-database instance security identification identifier number vs. mismatching problem employee 8 identification (e.g., IBM vs. I.B.M or social number) has been resolved and the information is available for the elsewhere (Wang & PQP to use. A discussion of this issue has been presented Madnick, 1989b). The domain mismatch problem such as unit ($ vs. V), scale (in billions vs. in millions), description interpretation ("expensive" vs. "$$$", "Chinese Cuisine" vs. has been resolved during schema integration and the information ^ is "Hunan and or Cantonese") also available to the Application Composite Query Answer PQP. r Application Schema Metadata Dictionary Figure 1 : DBMS DBMS Query Result The Query Processor Section 2 defines the polygen model. Architecture Polygen query translation is presented in Section 3. Section 4 provides a detailed example of the basic polygen query processing mechanism. The necessary and sufficient condition of source tagging is presented in Section 5. Finally, in section 6. concluding remarks are made The Polygen Model 2. To pre^nt the polygen model more relationships between the polygen schema concretely, and their {(database, relation, attribute),...) for the source tagging The we first exemplify the attribute mapping corresponding local example described PORGANIZATION schemata in Section in the 1. Polygen Scheme ONAME INDUSTRY CEO ((AD, BUSINESS, BNAME), (CD, RRM, FNAME)) 1(AD, BUSINESS, IND)) KCD, RRM, CEO)) HEADQUARTERS ((CD, RRM, HQ)) form {PORGANIZATION, PnNANCE,PALUMNnjS, PCAREER} A the first is fX)lygen domain is defined as a set of ordered datum drawn from a databases from which the a simple datum domain in The originates. Each triplets. triplet consists of three an LQP. The second third is a set of is a set of LDs denoting elements: LDs denoting the local the intermediate local databases whose data led to the selection of the datum. A same jX)lygen relation p of degree n set of attributes relation portion, is an ordered and is a finite set of time-varying n-tuples, each n-tuple drawing values from the corresponding polygen domains. triplet c=(c(d>, c(o), c(i)) where the intermediate source portion. c{\) c(d> denotes the Two datum A having the cell in a polygen portion, c(o) the originating polygen relations are union<ompatible if their corresponding attributes are defined on the same polygen domain. Note that schemes. local relational sources. A P contains the mapping information between In contrast, a fx)lygen scheme and the corresponding p contains the actual time-varying data and polygen scheme P and a polygen relation their originating p may be used synonymously without data and intermediate source tags for p are updated along the way corxfusion. The as polygen algebraic operations are f)er formed. 2.1 THE POLYGEN ALGEBRA Let attrs(p) denote the set of attributes in p. denote the data pwrtion, X 6 attrs(p), X = t(o) corresponding the originating source portion, {xi...,x,,...,xj) is a sublist attribute x, let pKX) be the columns to attribute x, For each tuple and in be the cells in On p, let t(d) the intermediate source portion. t corresponding column in p corresf)onding to the sublist of attributes X. column corresponding p while t(XXi) denotes the intermediate source portion of the X polygen relation If to the sublist of attributes X, let t(x) be the cell in such, p(x)(o> denotes the originating source portion of the relation t(i) in a of attrs(p), then let p(x) be the p corresponding let t(X) and t cells to attribute x in corresponding column corresponding t As polygen to the sublist to attribute x in polygen relation p inclusive of the data, originating source, and intermediate source portions while t(X) of attributes denotes the in tuple cells t. the other hand, p(x) denotes the corresponding to the sublist of attributes 11 X in tuple t inclusive of the data, originating and intermediate source portions. Note source, that the "( )" notation in project p(X) should not be confused with the operation f>(x=y). shown has been It (Maier, 1983) that in a conventional relational system, a relational algebra can be defined through five orthogonal algebraic operators. Here we define the five orthogonal algebraic operators in the context of our polygen model: Project p{X) = If . {f p is a p>olygen relation, f = t(X) I if pA t6 and X = t(X)<d> is , .. . The above expression originating source portion projection. This ep A tk U...U tk(xj)(o> V X e xj t'(xj)<i>= ti(xj){i> specifies that if the data portion of a projected tuple and the intermediate source portion is any one of the data portion of these tuples to By the same token, the intermediate source portions as the This relation. is new (pi xp2) = (tj ° t2 If 1 p, ti and e pv] and t2 € p2 where ° . If p(x e y) = (f p I is all new t<d), t'(o) involved. On spjecifies that the the other hand, if operator will take original source portion for each of the drawn from all the originating the intermediate sources. specifies that each tuple in p, = t(o), iftep A f(w)(i) = t(w)<i> t(x)<d) e t(y)<d>}. 12 is concatenated with every tuple Since no data items are and intermediate source portions remain = unique, then the is project operator will take the union of the to a polygen relation, x € attrs(p), y e attrs(p), t'(d> X denotes concatenation). following the definition of the Cartesian product. originating source e are two polygen relations, then p)2 The above expression Restrict is also correct because the projected data have been . xj intermediate source portion for each of the cells in the projected sources and have been derived with the involvement of Cartesian product V be the projected data portion (since they are the same), the union of the originating source portions as the cells in the projected relation. tk(xj)<i> identical to the those of before the one piece of data k tuples has the same projected data portion, then the expression and take u...u ti(X)<d>=...= tk(XKd)). correct because in this case, only is a sublist of attrs(p), then unique; t'<d)=ti(X){d), t'(xj)<o>= ti(xj)(o> if ti {xi...,x,,...,xj) is u merged in this case, the be the same. and is u t(y)(o) t(x)(o> a binary relation, then Vw € attrs(p), in p2 The above expression operator will update the intermediate source portion restrict and specifies that for each of the tuple in t(y) because they are used to produce the new polygen effected because the data portion (pi If . that satisfies the 9 relation, the include the originating sources of t(x) The originating source portion not relation. u P2 ) and p2 are two polygen pi = (f t I - ti if ti<d>e pi t'=t2 if t2<d) A tuples, ti(d> and portion of the each of the t2{d), new it n, ti e pi, t2 g pj, then 6 P2; t2<d> t'<d>=t,<d>, t'(o>=t,(o> union operator will copy and both have degree t(i>. t,(d>« P2; € p, A The above expression relations is that duplicate tuples are treated as separate Since Select and Join are defined through Restrict, they also upxiate tuples). Union unique (we assume is to p u t2<o>, tXi>=t,<i> u t2<i> if t,(d)=t2<d» one polygen specifies that for the tuples that exist in only new over to be the tuple. On the other hand, if relation, the the data portion of two are identical, then the operator will copy the data portion over to the data tuple, cells in the and take the union of the originating sources new By the same token, tuple. intermediate source portions as the new to be the originating sources for union of the the operator will take the intermediate source portion for each of the cells in the new tuple. Difference in p. If pi . Let p<o) denote the union of and p2 are two polygen (Pi - P2) = (f I t'(d>= t<d), t'(o> = all the t(o> sets in p, and p<i) denote the union of relations t(o>, and both have degree t'(w)<i) = t(w)<i>u p2(o) the t<i) sets then n, u all p2<i> V w e attrs(p), if t e pi and t(d>ep2). Difference selects a tuple in pi to be a tuple in (p, - P2) not identical to those of the tuples in in P2, it follows that source set of (pi - all P2), as p>2. the data portion of the tuple in p^ Since each tuple in pi needs to be compared with all is the tuples the originating sources of the data in p2 should be included in the intermediate t'(i) = t(i) u p2(o> u p)2(i> denotes. Other traditional operators can be defined common if in terms of the above five operators. The most are Join, Select, and Intersection. Join and Select are defined as the restriction of a Cartesian product. Intersection is defined as the project of a join over 13 all the attributes in each of the relations involved in the Intersection. In order to process a pwlygen query, polygen model: A is Retrieve, Coalesce, Outer Natural local database relation considered as a may PQP needs to like a to introduce the following Primary join, in the conventional PQP.^ reside physically in the The Outer Natural Total op>erators to the and Merge. Join, PQP first before it polygen model because a pwlygen operation base relation. This is required in the view new be retrieved from a local database to the require data from multiple local databases. dynamically any we also need PQP Although a base relation can be materiaHzed database system, for conceptual purposes, Retrieve operation is defined as an LQP we define it to without Restrict of>eration restricting condition. and Outer Natural Coalesce surprising number one column. We polygen An have been informally introduced by Date to handle of practical applications. Coalesce takes Outer Natural Join is an outer join two columns as with the and coalesce them into join attributes coalesced (Date, 1983). on For example, the Outer Natural Primary Join for ONAME. An Outer Natural Total Join PORGANIZATION an Outer Natural Primary is other polygen attributes in the polygen relation coalesced as well. In the an Outer Natural Total number input, a define an Outer Natural Primary Join as an Outer Natural Join on the primary key of a relation. Natural Join Join Join would perform an Outer Natural Primary of Coalesce operations Natural Total Join to include on INDUSTRY, CEO, and more than two Join on ONAME It an Outer with PORGANIZATION HEADQUARTERS. px)lygen relations. Join is all the example, followed by a Merge extends Outer can be shown that the order in which Outer Natural Total Join are performed over a set of polygen relations in a Merge is imnrvaterial. Since Coalesce can be used in conjunction with the other polygen algebraic operators to define the Outer Natural Primary Join, Outer Natural Total Join, and Merge, we define Coalesce as the sixth orthogonal primitive of the piolygen model. Coalesce attrs(p) . - Let © denote (x, y}, and w is the coalesce operator. If p the coalesced attribute of x is a polygen relation, x e attrs(p), and y, then This approach simplifies the Polygen Operation Interpreter, to 14 y € b>e presented in Section III. attrs(p), z = ® y:w) p(x (f I = f(z)=t(z), f(w)<d>=t(x)(d>, t'(w)<o> =t(xKo) ut(y)(o> f(z)=t(z), f(wKd>=t(x)<d>, f(w)<o> =t(x)<o) f(z)=t(z), t'(w)<d>=t(yKd>, f(w)<o> =t(y)(o> The above expression attribute called w. of coalesce, they must have new tuple. source portions as the coalesced & new may be , specifies that attribute x new same the t'(wKi> =t(y)<i> and and take the union of the originating sources By the same token, the operator intermediate source portion of the tuple. Note inconsistent. does not is assumed Section 4 to it is is compose information with know w^ill to (and by the definition copy the data portion of be the originating sources of union of the intermediate new tuple. For those tuples that cell in the exist, the of)erator will is copy the and cell environment, the data values that inter-database instance data source tags new with to be mismatching problems (Wang performed. will be used in intermediate source tags. In order to do that, This presented below. For the Polygen Query Translation SQL polygen query SQL polygen query is = "MBA") In this expression, those The result is presented in Section a corresponding polygen algebraic expression (ANAME =CEO) PORGANIZATION (ONAME, CEO) alumni with an joined with the followed by a projection on 1, as follows: PALUMNUS (DEGREE relation. exist the process of translating a polygen query into a query execution plan. 3. for the t(x)<d)=nil}. if have presented the polygen model and the polygen algebra. The algebra necessary to process t(x)<d>=t(y)(d); will take the that in a heterogeneous distributed It if attribute y will be coalesced into a Madnick, 1989b) will be resolved before the coalesce operation We , , t(yKd)=niI; if , value), then the coalesce operator either the data portion of attribute x or attribute y data over to the f(w)<i) =t(x)(i) , both of the data portion of attribute x and attribute y If the cell over to attribute w, the cell in the f(w)<i) =t{x)<i> ut(y)<i> , MBA degree are selected from the PORGANIZATION ONAME and CEO. 15 relation where an alumnus PALUMNUS is also a CEO, In general, the PQP takes a polygen algebraic expression as an input and produces a query execution plan for retrieving data from the local databases and formulating composite information. Three components are involved Interpreter, in this process: the Algebraic and the Query Optimizer, as shown Query Polygen Intermediate Operation Operation Execution Expression Matrix Matrix Plan The Polvgen Querv 2: Operation Matrix. is a polygen algebraic expression and generates a Polygen 1 below. The performed on the Left-Hand Relation (LHR) Attribute need for a (LHA) DEGREE and Details of the Algebraic Analyzer 1: is first row indicates that a Select operation should be PALUMNUS using the 9 relation "=" between the Left- the Right-Hand Attribute Right-Hand Relation (RHR). Table Translation Process For example, the Polygen Operation Matrix for the example polygen algebraic presented in Table Hand Operation 2. Polygen The Algebraic Analyzer parses PR Figure Algebraic Figure expression in Analyzer, the Polygen beyond The result is (RHA) "MBA." In this case, there for the no denoted by R(l), a Polygen Relation (PR). the scope of this paper. The Polygen Operation Matrix is Example Polygen Algebraic Expression The input to pass one is Intermediate Operation Matrix. a Polygen The output from pass one (and input Intermediate Operation Matrix, as depends on where first row of Table the data resides. 2), it is Operation Matrix as Table shown in Table Note that when 2. The execution 1 to exemplifies and an empty pass two) is a half-processed location (EL) of the execution location is an LQP (e.g., Table 2: AD in the also used as the originating source tag for each of the cell, c(o), of the polygen base relation (R(l) in this case). PR an of)eration A Half-Processed lOM Generated h\ Pass One of the POI Algorithm Table 3: An Intermediate Operation Matrix for the alternative execution plans, factors the differences in speeds into all queries sent to a local In this DBMS can example, the first further processed by the PQP to the in order to systems will most likely have their optimization methods. As cost evaluation, and insures that be processed there (Dayal, 1983). two rows of Table simultaneously and the third row its Company Database (CD) LQP. The produce own 3 are routed to the Alunnni Database a (AD) LQP returned relations are composite answer. Note also that the local database high-level query languages, such as SQL, with their own such, the algebraic expressions could be synthesized before sending to the corresponding local database systems. 4. We now Example Source Tagging in the illustrate the processing of the relations using Table 3 as a example polygen query assuming the following local query execution plan. The Alumnus Relation (AD) AID# PQP The Career Relation (AD) The Business Relation (AD) (3) The polygen query processor can derive the information Database's BNAME relation and the information of and Company Database's (ONAME, (AD, that FNAME Genentech from the pKjIygen schema relation his information can be CD)). from the Alumni is shown to the user upon request with a simple mapping. In this simple example, the data source information can be obtained polygen schema. The intermediate source information is by inspection from the not observable from the polygen schema. In a federated database system with hundreds of databases in which a polygen query is optimized to select only the relevant databases for information retrieval, the data source information observed from the polygen schema is a suf)er set of the result obtained by the PQP. We now turn our attention to other theoretical issues of source tagging. The Necessary and 5. The polygen model presented in Section 2 is based on the assumption that the source the cell level after the data has been retrieved from a local database. addressed in this section: (1) Tagging Sufficient Condition of Source Two (That is, tagged at fundamental issues are How many other p>otential approaches exist for source the closure property hold for the polygen algebra? is (2) Does does a polygen operation over a set of tagging? polygen relations always produce a px)lygen relation?) We address these two issues through the following lemma and theorem. that although there are four conceivable sources are tagged by (Lemma) to tag sources, the closure by cell, by tuple, Since the pxjlygen model is by Model to a attribute, and only if to source and by relation. based on the Relational Model, the granularity of a data object the other hand, the granularity cannot be finer than a relation. if polygen model, there exists four ways tagged cannot be coarser than a relation because a relation On property holds we show cell. In extending the Relational tagging: ways Specifically, In addition, source tags are deleted or be the basic unit of an algebraic operation. cell because a cell is the smallest unit of a updated by algebraic operators, operations either by tuple (Cartesian product, union, 21 is to difference, and restrict) or all by of them perform attribute {project. coalesce). may follows that sources It be tagged by ^Theorem) The closure property holds if and only cell, if operation, e^ e E, eit' e E, if cell, is by polygen model V e e E. portion can be f is same is by of ei (e, and t2 e<o)) e where 62, ° model in (ei , e]<o» x e2<o». model can be expressed as ti(d>=t2(d> t2(x)(o) if ti(x)(o> may . is by Cartesian product. If denote an algebraic Cartesian product, union, or = e is show ((.e^ e^') for by induction, that, shown below; the intermediate source use the notations developed in Section is by 2. cell. is is not by by cell. It by relation, follows, by the attribute, or polygen model can be expressed as By if polygen relation defined by the a definition, the operation yields (t, by (e, e<o)). " t2 : However, the result t2(x)<o). It is by be different from attribute, then t2. t^ follows that source It an attribute in this cannot be expressed in the form of (e(x), is follows that source tagging by attribute is not feasible. polygen model can be expressed as (t, t(o)). not feasible. By contradiction, we conclude u e<x)(o» because the similar argument, the result cannot be expressed in the form of follows that source tagging by tuple polygen Consider union. By definition, tXd)=ti(d), t'(xKo)=t,(x)(o> tuple, then a tuple in this By we may t, source tagging (e(x), e(x)<o)). be different from source tagging f denotes concatenation). However, the result cannot be expressed in the form not feasible. is i.e, which source tagging because the originating source tags from tagging by relation cell. We now and source tagging (e2, relation. project, restrict, or coalesce; e^t] relation, then a relation in this Consider the Cartesian product of e if f is => Source tagging that the closure property holds source tagging E e^ e token. For consistency, that there exists a polygen by f(ei, e,') if f is the originating source portion is The closure property holds Suppose If some = 62 then the closure property holds, the is denote two base p>oIygen relations. Let Cartesian product, union, or difference. Only shown by (Proof) Part 1: tuple. ei' by attribute, or possible combinations of algebraic operations all f(e]) if f is project, restrict, or coalesce. source tagging Lemma, and Similarly, let ey+i = He),) for difference. some = e2 ei by tuple, source tagging Let E denote the set of results obtained from defined in a polygen model. Let by If Consider (t, t<o». It that the proposition is true. Part 2: Source tagging The premise is by cell => that source tagging is The closure property holds. by cell justifies 22 the usage of the polygen model presented in By Section 2 in the following proof. ej , V ej 6 E, and the closure property holds Assuming holds V the model's definition, t<o) is the set of the originating source ev+i e, V e^ £ E, we show the closure property also that e E. Two tk(xj)<o) V xj V € X. x; For Cartesian product, ey^i = For = t(xj)(o) difference, e ek A (t(x)(d> 8 t(y)<d))). e^^i = (e^ we have presented Source Tagging problems. Furthermore, we have In the second case, e'^ = ) (t = : (t, t ° t2 g e^ and e e^, t<d>« e'J. For same in e'^ t2 e P, where "denotes e'^) restrict, e^*] = ev(x 9 y) = ( Cartesian product, difference, and restrict, ) , = (e^- e^^i coalesce following the similar conclude that the proposition a t, : e'^) and ei-^i = e^ix 6 y). t : t it The arguments. From the Principle of is true. Summary and Conclusions polygen model for resolving the Data Source Tagging and Intermediate The polygen model research addresses issues a perspective that, to the best of our in data integration from the knowledge, has not been studied to date. presented a data-driven query translation mechanism for mapping a polygen algebraic expression into a set of intermediate polygen op>erations dynamically. System (2) t'(xj)<o>= ti(xj)<o> x (ev.- e'k) and 6. - e X. follows that the closure property holds for 6^+] = etCX). Since t(o) remains the closure property holds for union "where" persp)ective xj unique and It follows that the closure property holds for e^^i = (e^ x Mathematical Induction, V t(X)(d> is Since the closure property holds for ti(X)(o>=...= tk(X){o>, thus the closure 6 X. property also holds for t'(xj)(o) concatenation. cases need to be considered: (1) In the first case, t'(xjXo> ti(X)<d)=...= tk(X)(d). We te e E. that the closure property holds for For projection, e^+i = e^(X). U...U V V A Prototype, called has been implemented (Yuan, 1990) to demonstrate the feasibility the polygen model and the polygen query processing capability presented in this paper. This research has also provided us with a theoretical foundation for further investigation of many other critical research issues in heterogeneous distributed systems, for example the cardinality inconsistency problem which is inherent in heterogeneous database systems.'' It also enable us to Under the relational assumption, the cardinality inconsistency problem exists in heterogeneous database systems because the referential integrity is not enforceable over multiple pre-existing 23 interpret information from different sources more accurately. By storing the metadata about each of the data sources in the PQP, many domain mismatch, semantic reconciliation, and data conflict problems could be resolved systematically using the data and intermediate source tags. polygen models can be developed for Furthermore, other heterogeneous distributed database systems based on the Entity Relationship Model, the Functional Data Model, and the more recent object-oriented models (Manola Dayal, 1986; Shaw & Zdonik, 1990). The data source and intermediate source information can be very valuable the polygen query processor in formulating cost-effective, customized, information in a federated database environment. seamless access to & to the user as well as and credible composite As more and more important applications require and integration of data from multiple heterogeneous database systems both within and across organizational boundaries, these capabilities will also become increasingly critical. databases which have been developed and administered independently and are likely to remain 24 so. Appendix: The Operations that Generate Table The second and third row of Table 3 indicates that the be retrieved from the Alumni Database and the corresponding data source A2 cells are the set below. The intermediate source is BUSINESS and FIRM Company Database set relations should respectively. (AD) and (CD) respectively as show^n an empty 5. in As such, the Table Al and Table because no other data sources have been involved obtaining these relations. Table Al: The Business Relation in Table A3 The Outer join of Table Al and Table BNAME A2 References [1] [2] Batini, C, Lenzirini, M., & Navathe, S. (1986). A compararive analysis of methodologies for database schema integration. ACM Computing Survey, lfi(4), pp. 323 - 364. Bernstein, P. A., et ACM [3] (1981). al. C^ery Processing & Thompson, Breitbart, Y., Olson, P. L., heterogeneous database system. Los Angeles, [4] System in a for Distributed Databases (SDD-1). Transactions on Database Systems, 6(4), pp. 602-623. G. R. (1986). Database integration in a distributed CA. February 1986. pp. 301-310. Templeton, M., & Yu, C. (1984). Distributed query processing strategies in MERMAID, management systems. First International Conference on Data Engineering. Los Angeles, CA. February 1984. pp. 301-310. Brill, D., a frontend to data [5] & Ceri, S. Pelagatti, G. (1984). Distributed Databases Principles & Systems (1st ed.). McGraw- Hill. [6] Codd, E. F. (1979). Extending the relational database model to capture more meaning. ACM Transactions on Database Systems, 1(4), pp. 397-434. [7] Czejdo, B., Rusinkiewicz, M., & Embley, D. (1987). An approach to schema integration and query formulation in federated database systems. The 3rd International Conference on Data Engineering. Los Angeles, CA. 1987. pp. 477-484. J. (1983). The outer England. 1983. pp. 76-106. [8] Date, C. (9] Date, C. [10] J. An (1990). The 2nd International Conference on Databases. Cambridge, join. Introduction to Database Systems (5th ed.). Addison Wesley. Dayal, U. (1983). Processing Queries Over Generalization Hierarchies in a Multidatabase The 9th International Conference on Very Large Data Bases. August 1983. pp. 342-353. Systems. Ill] [12] Dayal, U. & Hwang, K. (1984). View definition and generalization for database integration in multidatabase system. IEEE Transactions on Software Engineering, SE-10. pp. 628-644. Deen, S. M., Amin, R. R., & C, T. M. (1987). Data integration in distributed databases. IEEE Transactions on Software Engineering, SE-13. pp. 860-864. [13] [14] [15] Deen, S. M., Amin, R. R., & Taylor, M. C. (1987). Implementation of a prototype for PRECI*. Computer Journal, 20.(2), pp. 157-162. DeMichiel, L. G. (1989). Performing operations over mismatched domains. The Fifth Intemationl Conference on Data Engineering. Los Angeles, CA. February 1989. pp. 36-45. Elmasri, R., Larson, J., & Navathe, databases and logical database design. [16] [17] S. (1987). Honeywell Schema integration algorithms for federated Inc., Submitted for Publication. 1987. & Wong, E. (1978). ACM-SIGMOD Conference. May Epstein, R., Stonebraker, M., Distributed Database System. 1978. Ferrier, A. & Strangret, C. (1982). Heterogenity in systems Sirius-Delta. TTie 8th International Conference Query Processing in a Relational database marmgement on Very Large Data Bases. Mexico City, the distributed Mexico. 1982. [18] [19] Gupta, A. (1989). Integration of Information Systems: 334). New York, N.Y.: IEEE Press. Gupta, for A., et al. (1989). Cambridge, [20] [21] architecture comparison of contemporary approaches information systems. and products Sloan School of Management, MIT, MA 02139. 1989. Heimbigner, D. ACM An heterogeneous integrating Bridging Heterogeneous Databases (pp. & McLeod, D. (1985). A Federated architecture for information management. Transactions on Office Information Systems, i, pp. 253-278. Hevner, A. R. & Yao, S. B. (1979). Query Processing in Distributed Database Systems. /£££ Transactions on Software Engineenng, SE-5 (3). pp. 177-187. [22] Hull, R. & King, R. (1987). Semantic database modeling: 27 survey, applications, and research issues. [23] ACM Computing Surveys, 12(3), pp. 201-260. & Goodman, Katz, R. H. Chen system. In P. P. ER (pp. 259-279). [24] W. & Litwin, N. (1981). View processing in MULTIBASE, a heterogeneous database View processing in MULTIBASE, a heterogeneous database system (Ed.), Institute. Abdellatif, A. (1986). Multidatabase interoperability. /£££ Computer, , pp. 10- 18. [25] [26] Symposium on al. (1982). SIRIUS system for distributed data management. Distributed Databases. Berlin, West Germany. 1982. pp. 311-366. Lyngbaek, & McLeod, Litwin, W., et An D. (1983). approach to object sharing in distributed database on Very Large Data Bases. October 1983. pp. 364-374. Madnick, S. E. (1989). Information Technology Platform for the 1990s. In M. S. S. Morton (Ed.), Information Technology Platform for the 1990s (pp. 19-48). Cambridge, MA: Sloan School of Management, MIT. systems. [27] P. The 9th International Conference S. E., Siegel, M., & Wang, Y. R. (1990). The Composite Information Systems Laboratory (CISL) project at MIT. /£££ Data Engineering, 12(2), pp. 10-15. [28] Madnick, [29] Maier, D. (1983). The Theory of Relational Databases (1st ed.). [30] Manola, & F. Dayal, U. (1986). PDM: An object-oriented Workshop on Object-Oriented Database Systems. [31] International Peckham, J. & Maryanski, F. (1988). Pacific Computer Science data model. The Press. International Grove, CA. September 1986. pp. 18-25. Semantic data models. ACM Computing Surveys, 2fl(3), pp. 153-189. [32] Rockart, Sloan [33] J. F. & Short, J. E. (1989). IT in the 1990s: Management Review, Sloan School of Managing Organizational Interdependence. Management, MFT, M(2), pp 7-17. & Czejdo, B. (1985). Query transformation in heterogeneous distributed The 5th International Conference on Distributed Computing Systen\s. Denver, Rusinkiewicz, M. database systems. CO. 1985. pp. 300-307. [34] Rusinkiewicz, M., et al. (1988). Query processing system. University of Houston. 1988. [35] Shaw, G. & Zdonik, S. International Conference B. (1990). A in omnibase - a loosely coupled multi-database query algebra for object-oriented databases. The Sixth on Data Engineering. Los Angeles, CA. February 1990. [36] Shin, D. G. (1988). Semantics for handling queries with missing information. International Conference on Information Systems. 1988. pp. 161-167. [37] Smith, J. M., et al. (1981). Multibase Systems. 1981 National [38] Templeton, M., et al. heterogeneous databases. [39] Wang, Y. R. & - Wang, Y. R. systems. Madnick, [41] Wang, & Madnick, ACM Y. R. Heterogeneous Distributed S. E. Database 1981. pp. 487-499. (1983). An Overview of the MERMAID system IEEE EASCON. Washington, DC. 1983. - a frontend to among information systems. Composite Sloan School of Management. (1988). Connectivity Information Systems (CIS) Project (pp. 141). [40] Integrating Computer Conference. The Ninth S. E. (1989a). MIT Facilitating connectivity in composite information Data Base, 2Q.O), pp. 38-46. & Madnick, S. E. (1989b). integrating autonomous systems. The The inter-database instance identification problem in Conference on Data Engineering. Los Fifth International Angeles, CA. February 1989b. pp. 46-55. [42] [43] Y. R. & Madnick, S. E. (1990). A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective. To Appear in the 16th International Conference on Very Large Data Bases. Brisbane, Australia. August 1990. Wang, Yuan, Y. (1990). The design and implementation of system P: A polygen database masnagement system. (CIS-90-07) Composite Information Systems Laboratory, Sloan School of Management, MIT, Cambridge, MA. May 1990. 28 '1 o t 17 7 Date Due i£? l^ B9 orT, 07 ^M Lib-26-67 MIT LIBRARIES DUPL 3 TDflD 0D701S71 T