Metadata in a Grid Construction Krzysztof Karczmarski †, Piotr Habela #, Kazimierz Subieta *# † Warsaw University of Technology, # Polish-Japanese Institute of Information Technology, *Institute of Computer Science, Polish Academy of Sciences kaczmars@mini.pw.edu.pl, habela@pjwstk.edu.pl, subieta@ipipan.waw.pl Abstract An approach to the construction of a data-oriented Grid assuming integration of active objects using updateable views is proposed. In this paper, we focus on the necessary metadata definitions, playing the key role in the Grid construction. We briefly describe the assumed architecture, realistic Grid development scenario and the necessary programming abstractions. The core constructs of the assumed data model are presented in the form of metamodel and the schema definitions based on it are suggested. While the proposed solution does not remove the complexity of the integration task, we assume it to be the only feasible approach if the data updating is going to be supported. 1. Introduction Implementing the idea of a Grid becomes very difficult in case of data-intensive applications. The need of moving large amounts of data between system’s nodes implies the cost and difficulties that motivate a different approach when a data Grid is considered. In this case, many issues known from the research on distributed and federated databases [11] need to be taken into account. In our research, we follow some patterns of federated database construction and extend them towards active objects and flexible (possibly – multi level) data source composition. Due to its expressiveness and popularity, object model was chosen as a canonical data model for Grid integration. In comparison with mainstream object models known from popular programming languages we suggest a higher level of object relativism, allowing to uniformly treat both primitive and complex objects and to construct arbitrarily complex object compositions. Such objects, exposing their data and behavior, can be considered compliant with the Web Service notion, but offer greater flexibility for data-oriented tasks. Data model respecting object relativism allows to cleanly representing Grid building blocks, like particular nodes or even higher-level items. To assure a reasonable productivity of the data integration task, we suggest resorting to a fully-fledged object query language as a data manipulation language for both global and member node environments. Data adaptation to the format assumed by the Grid will be supported by updatable object views. The paper is organized as follows. Section 2 introduces to our assumptions concerning Grid development process and the technologies required for its realization. Section 3 containing the main contribution of this paper, presents the role of metadata, describes the core notions and discusses the form of metadata representation. Section 4 comments the related research. Section 5 concludes. 2. Architecture assumed and business process 2.1. Grid development process From a practical point of view, we are skeptical about the feasibility of ad-hoc (or dynamic) integration solutions (except for perhaps very simple data structures). Thus, for any serious project, a full development cycle and precise rules obliging every participant are required. We assume the following Grid construction scenario: 1. Strategic phase, when the decision on creating a Grid is made and an initial analysis of the required content and potential participants is performed. 2. Analysis phase, when the existing resources are elaborated and confronted with the information requirements for the integrated service. The issues like data heterogeneity, redundancy or incompleteness need to be identified. 3. Design phase, which results in the precise definition of the global schema and contributory schemas for each participant. The task of transforming local resources to the agreed contributory schema is delegated to particular node. On the other hand, the specification of mapping the contributions into the global schema needs to be specified as an integral part of the Grid design. 4. Finalization phase, when participants sign the final agreement, thus formalizing the obligations coming from contributory schemas specifications. 5. Implementation phase, when the necessary data adaptation described by contributory schemas and global schema are implemented. Depending on the heterogeneity between a given node’s data and global schema, such node may require either appropriate wrapper or may require in-depth restructuring of its original design. The resulting specifications determine the required form of data provided both by the global service (the Grid itself), as well as each of the participants. The task of adjusting local data into the form required by the Grid can be distributed between participants (that is, the administrators and developers of local systems) and the integrator (that is, the developer of the system serving global, integrated data view). Although different ways of balancing this responsibility are possible, we assume that in order to simplify the integration itself, local sources should be obliged to adapt their contribution as far as possible based on the knowledge of their local resources. updateable object views mechanism developed for our framework is well suited for this purpose [7]: The virtual objects provided by a view can be processed analogously like regular objects. The view definition allows to independently specify the behavior for its objects’ read, update, insertion and removal, including arbitrary side effects. Any of abovementioned generic operations can remain undefined, thus limiting the access to objects covered by a view. Another factor of high importance for the productivity of the proposed approach is the choice of programming language used to create local data adapters and to integrate contributed data. To allow high-level programming and to avoid the so-called “impedance mismatch” among different programming paradigms we choose SBQL (Stack-Based Query Language) [12]. The language offers powerful query operators as well as imperative constructs of typical programming languages. It is fully orthogonal in the sense its queries can serve as method building blocks, input parameters or functional method output. The object model assumed allows for direct association links between objects, based on their globally unique identifiers. Such links can reference non-local sources. However, especially in case of bottom-up integration scenario, the objects from various sources may require also other identification method, using traditional (domain specific or artificial) uniqueness keys. For example, if several independently build systems may store different data concerning the some people, the Grid may require collecting those data using join condition based e.g. on person’s social security number. 2.2. Technology requirements 2.3. Possible architectures As suggested in the previous subsection, both the participant’s contribution as well as the final form of Grid-served data needs to be published in a form compliant with canonical data model. This requires implementing or wrapping local data sources using the same programming abstractions as those served to Grid clients. The primary motivation behind it is simplifying the data composition on the Grid level. This arrangement requires the traditional database functionality, with special emphasis on virtual (that is, not materialized) object views. The views are necessary at the Grid level (we call them global views) for mapping multiple contributing data sources into the globally served data. They are also needed at contributors’ side (contributory views), to limit access and adapt local data according to the contract established by their contributory schemas. The Another potential benefit of making the structure of contributory data uniform with the global data is the flexibility allowing to create multi-level Grids and arbitrarily combined structures. The structure (see Fig. 1) is similar to a snowflake, where some nodes contribute to a global view, which in turn becomes a participant in another level of integration. Fig. 1. Some of possible architectures of multi-level Grids: snowflake, ring, random to data source CV to clients GV GV CV Layer 1 Layer 1 Contributory Contributory Contributory Schemas Schemas Schemas Fig. 2. Illustration of the layered Grid idea. CV stands for contributory view, GV stands for global view and NLD stands for node local data. The term “global” is used here relatively to the lower level. An integration point not necessarily has to be unique. Using exactly the same notions we may also create a ring-like Grids or even randomly connected databases (Fig. 1). Only the resources available, market tradeoffs and business strategic decisions limit such integration. Power of combination of contributory views and global views thanks to uniform object model is unlimited. Fig. 2 shows the idea of data integration done by global view (on the left), which is a connection point for Grid’s clients. Contributory views transform data into a form required by a global view (described in contributory metadata) according to concrete node limitations. The right part of Fig. 2 shows that sometimes a global view may serve as a contributory view if it needs to participate as a node in another Grid. That is possible because all metadata used is compliant (see the next section). 3. The role of metadata Although the integration scenario described above is rather static in the sense the considered data structures are determined at the design time, the proper handling of metadata remains critical for the successful integration. In this section, we summarize the requirements concerning schema definitions, present the core concepts of the relevant metamodel and discuss the possible ways of representing Grid metadata. 3.1. Schema definitions NLD CV NLD CV NLD to data source GV Integrated (global) Schema CV to data source CV CV CV GV NLD CV NLD CV NLD The process of integration deals with several levels of data and metadata describing them. In this subsection, we explain the already mentioned terms of different schemas used in the Grid design. Starting from the description of local resources those are the following: Local schema describes the original form of data as available for node’s local applications. We do not deal directly with this kind of schema since it is participant’s responsibility to transform the data to the form described by appropriate contributory schema. Local schema would specify what local data are to be published (contributed) and what data remains private (local). Contributory schema describes the required outcome of participant’s work on preparing its contribution. As already described, the shape of each contributory schema is subordinate to the Grid’s global schema requirements in the sense it should make the merging of the described contribution into the global schema as straightforward as possible. Global schema describes the integrated data as visible to the Grid clients. An important assumption made here is that a global schema may serve as a contributory schema for higher-level integration. Integration schema contains the missing metadata elements that describe the integration of contributory schemas into global schema. The mapping is realized by a set of virtual view definitions, called here global view. We do not assume any specialized data definition language constructs for this metadata. 3.2. Metamodel In this section, we sketch core elements of a metamodel realizing the abovementioned features. Its constructs serve as the building blocks of contributory schemas and the global schema. In our proposal we attempt to follow as far as possible the popular object models’ notions (UML [9], Java, IDL [10]/ODL [8]), although we avoid introducing secondary notions and, as already mentioned, try to achieve a higher level of object relativism. An important extension compared to traditional object models is the introduction of dynamic object role notion [6] and dynamic inheritance among objects that is assumed by it. named, are derived from the abstract Metaobject [conceptual model] class. Data items (objects) are described through their Type (for primitive values) or by Interface (in case of complex objects). Interface allows to define behavior (supported operations) and static properties (Subobjects and Association Links) of their object. The term “subobject” covers the notion of attribute and is used here to emphasize the ability to create compositions of complex objects. For each Structural Property its Metaobject + name Key + uniqueProperty * * + keyMembership 1..* + keyElement + uniquenessScope StructProperty 1 + multiplicity + insertable + readable + mutable + removable Type + contents Property Operation 1 + property * + super + sub * + base 1 * 0..1 + owner Interface + instanceName + isAbstract + usage * SubobjectLink + target 1 + referrer * + applicableRole AssocLink 0..1 + reverse 0..1 RoleInterface * Fig. 3. The core concepts of the contributed data metamodel To illustrate the discussed notions we follow the UML style of metamodel definition. Those constructs are necessary to describe contributing node’s database. Note that we are only interested here in externally accessible features programmer needs to know in order to create a contributory view, which is transforming local data structures to required form. Particularly, this document does not deal with implementation details (e.g. the mechanisms providing operations’ implementation are not discussed). Thus, we do not use the term “class” and instead use the notion of interface to provide the necessary objects’ description. Fig. 1 shows the core constructs describing database objects. All significant metadata notions that need to be multiplicity is specified. Other attributes describing Structural Property indicate, whether the following generic operations: insertion, modification, removal, read are supported for this property. Associations can be declared as unidirectional (although even in such case hidden opposite direction links would exist to help protecting database integrity). An important assumption of the presented model is a separation of object definitions from their storage structures. That is, in order to allow storing say Employee objects in a database, one needs to provide Employee interface definition as well as to define global static property to store Employee instances. The second task is realized by properties distinguished by the lack of interface assigned to them. As a side effect, it also allows to define global (that is – database-scope) procedures. 1 + consumer Operation 0..1 + producer + owner + input 0..1 + return SignatureElement 1 0..1 + multiplicity + isOrdered ConstructedValue 0..1 TypedValue + isMutable + isReference + signatureUsage * + bound Binder + name 0..1 + binder + contents 1..* 1 + type Type Fig. 4. The details of the operation’s signature in the contributed data metamodel For interfaces, traditional static generalizationspecialization declarations are allowed. Although not further discussed in this document, the dynamic object role notion [6] was introduced here as one of the fundamental concepts of the assumed object model. Dynamic object role provides a dynamic inheritance mechanism among objects. From the data definition point of view is treated similarly to object’s structural properties (e.g. a multiplicity of roles connected to a single base object can be defined). Uniqueness constraints allow specifying single and composite uniqueness keys. Note that if a given interface is used in different places of database structure, for each of those places [designated by appropriate StructProperty declarations] a different uniqueness constraint can be used. In other words, each uniqueness constraint concerning complex objects is scoped with respect to particular property declaration rather than to all instances (extent) of a given interface. This definition is further constrained by the concrete syntax of schema definition language. Fig. 4 presents operation’s signature. This is another place where our object model differs significantly from traditional object models, due to flexibility of queries that may be used as operations’ parameters. Procedure’s input and output data structures are treated uniformly and can take form of: Typed value: a value of predefined primitive type or an instance of schema-defined interface. Such result can be optionally specified as immutable. Moreover, the value can be referenced directly or through a reference link (pointer). Binder: a named contents (that may be a structure of any of kinds described in this list) accessible through this name. Constructed value: a structure containing one or more binders. For each case the items may be specified as multiple and (if so) – as ordered or unordered. To follow a structural type matching paradigm, some adjustments would be necessary. Fig. 5 shows additional declarations that we find important in the context of databases’ cooperation, namely – the replication paths. It is possible to identify the foreign databases involved. For each structural property (global or interface-hosted) it is possible to indicate: The foreign databases, to which the request of updating a copy of a given property is forwarded. The foreign databases, from which the requests of updating a given property can be received. 3.3. Global views as an integration specification At the beginning of this section, we have enumerated four kinds of schemas used within the Grid. However, only three of them we are going to describe using traditional data definition language declarations. The integration schema is realized implicitly through the global views definitions that bridge the set of contributory schemas with the global schema. This raises the question: should not the integration specification be also described by some higher-level declarative mean instead of just view declarations with their method bodies determining the accessible data? Metaobject + name StructProperty Me + + + + + multiplicity insertable readable mutable removable * + excludedItem + nonReplicationPeer + replicationPeer * PeerDatabase * + location * + replicatedItem Fig. 5. Data replication description in the contributed data metamodel However, there are two reasons that justify keeping the suggested solution: It is necessary to remember that the views definitions are formulated using a query language that provides very high-level expressive constructs. While some typical mappings could be further simplified by providing dedicated constructs, the benefit seems to be minimal, especially when considering the virtual objects’ update behavior. Thus, we conclude the view definitions to be an optimum mean of representing the mappings between contributory data and global schema. MetaObject describedElement MetaValue * name: string 1 metavalue value: string kind: string * 1 source 1 target instance * * MetaRelationship description 1 kind : string MetaAttribute name: string considered, the amount of metadata would need to be radically extended. Although the metamodel described above is not open for such extensions, the metadata structure we have proposed for the implementation level can flexibly handle any necessary metadata “annotations” [5]. The generic structure shown in Fig. 6 is capable of expressing the predefined (core) constructs, as well as any necessary extensions in the form of relationships among metadata or [meta]attributes describing it. However, we would like to stress, that although the appropriately precise metadata can effectively assist integration of new nodes, this process seems to require human assistance and insight into possible kinds of heterogeneity (especially concerning the domain knowledge level). Another issue is documenting the metadata created during design. Since the contributory schemas and global schema are described by data definition languages based on traditional object-oriented notions, they can be mapped into UML class diagrams in a straightforward way. On the other hand, the view definitions describing mapping between those schemas, constitute a new quality as very powerful programming notions. Development of an expressive and universal diagrammatic method of representing view definitions remains an open issue. 4. Related Research Fig. 6. Generic extensible metadata structure proposed for schema implementation 3.4. Other metadata and its representation The suggested scope of metadata provided by contributory schemas and a global schema is relatively narrow: it is limited to the technical description of provided data. It is possible thanks to the assumption that the member data sources are subjects of designtime analysis and thus the semantic details of the data can be handled implicitly. If some more dynamic (or more “automatic”) integration scenario were Data integration becomes increasingly important as institutions, companies and administrations want to cooperate on new levels, store and share enormous number of data. There are many initiatives defining metadata schemas specific to their domains as well as new architectures for database integration. From this point of view, our proposal is rather general, being able to describe arbitrarily any object-oriented data and service. Currently, one of the most important approaches to general shared data description is WSDL language [2]. It is a platform, service and programming language independent standard, used by Open Grid Services Architecture [4]. Its data description capabilities are expressed by XML and XML Schema. Our data model is rather a higher-level solution including all objectoriented database features like: inheritance, references, multi-value properties, roles, updateable views and a powerful query language. Thus, we are more focused on well-known database model, that is objects composed in complex structures rather than abstract services. This allows high-level analysis of distributed data and additional node-level and global-level optimizations. Our grid clients may query the global schema and get information much more sophisticated than stored in a Web Service catalogue. We also emphasize role of a Virtual Organization [3], as an initiative and source of a global integrated schema. Our grid building strategy splits data integration process description into two separated aspects: global (after integration used by Grid clients) and local (contribution required for integration). According to our proposal, OGSA using Web Services defines only local contributory metadata. Comparing to other object database metadata definition initiatives for object-oriented databases like ODL/IDL [10, 8], we avoid introducing secondary notions and achieve a higher level of object relativism. In addition, assumed query language and object model is one of the most powerful among those proposed for ODBMS systems, allowing full retrieve, create, update and delete interfaces plus message calling. 5. Conclusions and Future Work In this paper, we have outlined our approach to data Grid construction, focusing on the metadata required by this process. Based on the required properties of canonical data model and the programming constructs needed for data integration, the metamodel describing the core constructs was suggested. The notions of this metamodel form the definitions of contributory schemas and the global schema, which are established during the Grid development process and contracted through an agreement among participants. The approach presented here is less idealistic than some proposals assuming automatic or semi-automatic discovery and use of data sources. An important reason of this attitude is we take into account potentially very complex data to be integrated (e.g. healthcare databases integration) and intend to support data updating at the global level. However, in further perspective we consider investigating the ways to make the structure more open for integrating new participant nodes. In our future research, we plan primarily to implement the data integration features using our Object-oriented DBMS prototype implementation. Further directions include extending the metadata towards more dynamic participants’ integration scenarios. 6. References [1] E. Christensen, F. Curbera, G. Meredith and S. Weerawarana. Web Services Description Language (WSDL) 1.1. W3C, Note 15, 2001, [www.w3.org/TR/wsdl]. [2] D.C. Fallside, XML Schema Part 0: Primer. W3C, Recomm. 2001, [www.w3.org/TR/xmlschema-0] [3] I. Foster, C. Kesselman, S. Tuecke. „The Anatomy of the Grid: Enabling Scalable Virtual Organizations.” International J. Supercomputer Applications, 15(3), 2001. [4] I. Foster, C. Kesselman, J. Nick, S. Tuecke, “The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration.” Open Grid Service Infrastructure WG, Global Grid Forum, June 22, 2002. [www.globus.org/ogsa] [5] P. Habela, K. Subieta: Overcoming the Complexity of Object-Oriented DBMS Metadata Management. OOIS 2003: 214-225 [6] A. Jodlowski, P. Habela, J. Plodzien, K. Subieta. Objects and Roles in the Stack-Based Approach. Proc. of DEXA Conf., Springer LNCS 2453, pp. 514-523, Aix-enProvence, France, 2002 [7] H. Kozankiewicz, J. Leszczyłowski, K. Subieta. New Approach to View Updates. Proc. of the VLDB Workshop Emerging Database Research in Eastern Europe, Berlin, Germany, 2003 [8] Object Data Management Group: The Object Database Standard ODMG, Release 3.0. R.G.G.Cattel, D.K.Barry, Ed., Morgan Kaufmann, 2000 [9] Object Management Group: Unified Modeling Language (UML) Specification. Version 1.4, September 2001 [http://www.omg.org]. [10] Object Management Group: The Common Object Request Broker: Architecture and Specification. Ver. 3.0, July 2002 [http://www.omg.org]. [11] M. Roantree, J. Murphy and W. Hasselbring. The OASIS Multidatabase Prototype. ACM Sigmod Record, 28:1, March 1999. [12] K. Subieta, C. Beeri, F. Matthes, J.W. Schmidt: A Stack-Based Approach to Query Languages. East/West Database Workshop 1994: 159-180