3. The role of metadata

advertisement
Metadata in a Grid Construction
Krzysztof Karczmarski †, Piotr Habela #, Kazimierz Subieta *#
† Warsaw University of Technology,
#
Polish-Japanese Institute of Information Technology,
*Institute of Computer Science, Polish Academy of Sciences
kaczmars@mini.pw.edu.pl, habela@pjwstk.edu.pl, subieta@ipipan.waw.pl
Abstract
An approach to the construction of a data-oriented
Grid assuming integration of active objects using
updateable views is proposed. In this paper, we focus
on the necessary metadata definitions, playing the key
role in the Grid construction. We briefly describe the
assumed architecture, realistic Grid development
scenario and the necessary programming abstractions.
The core constructs of the assumed data model are
presented in the form of metamodel and the schema
definitions based on it are suggested. While the
proposed solution does not remove the complexity of
the integration task, we assume it to be the only
feasible approach if the data updating is going to be
supported.
1. Introduction
Implementing the idea of a Grid becomes very
difficult in case of data-intensive applications. The
need of moving large amounts of data between
system’s nodes implies the cost and difficulties that
motivate a different approach when a data Grid is
considered. In this case, many issues known from the
research on distributed and federated databases [11]
need to be taken into account.
In our research, we follow some patterns of
federated database construction and extend them
towards active objects and flexible (possibly – multi
level) data source composition. Due to its
expressiveness and popularity, object model was
chosen as a canonical data model for Grid integration.
In comparison with mainstream object models known
from popular programming languages we suggest a
higher level of object relativism, allowing to uniformly
treat both primitive and complex objects and to
construct arbitrarily complex object compositions.
Such objects, exposing their data and behavior, can be
considered compliant with the Web Service notion, but
offer greater flexibility for data-oriented tasks. Data
model respecting object relativism allows to cleanly
representing Grid building blocks, like particular nodes
or even higher-level items.
To assure a reasonable productivity of the data
integration task, we suggest resorting to a fully-fledged
object query language as a data manipulation language
for both global and member node environments. Data
adaptation to the format assumed by the Grid will be
supported by updatable object views.
The paper is organized as follows. Section 2
introduces to our assumptions concerning Grid
development process and the technologies required for
its realization. Section 3 containing the main
contribution of this paper, presents the role of
metadata, describes the core notions and discusses the
form of metadata representation. Section 4 comments
the related research. Section 5 concludes.
2. Architecture
assumed
and
business
process
2.1. Grid development process
From a practical point of view, we are skeptical
about the feasibility of ad-hoc (or dynamic) integration
solutions (except for perhaps very simple data
structures). Thus, for any serious project, a full
development cycle and precise rules obliging every
participant are required. We assume the following Grid
construction scenario:
1. Strategic phase, when the decision on creating a
Grid is made and an initial analysis of the required
content and potential participants is performed.
2. Analysis phase, when the existing resources are
elaborated and confronted with the information
requirements for the integrated service. The issues
like
data
heterogeneity,
redundancy
or
incompleteness need to be identified.
3. Design phase, which results in the precise
definition of the global schema and contributory
schemas for each participant. The task of
transforming local resources to the agreed
contributory schema is delegated to particular node.
On the other hand, the specification of mapping the
contributions into the global schema needs to be
specified as an integral part of the Grid design.
4. Finalization phase, when participants sign the
final agreement, thus formalizing the obligations
coming from contributory schemas specifications.
5. Implementation phase, when the necessary data
adaptation described by contributory schemas and
global schema are implemented. Depending on the
heterogeneity between a given node’s data and
global schema, such node may require either
appropriate wrapper or may require in-depth
restructuring of its original design.
The resulting specifications determine the required
form of data provided both by the global service (the
Grid itself), as well as each of the participants. The
task of adjusting local data into the form required by
the Grid can be distributed between participants (that
is, the administrators and developers of local systems)
and the integrator (that is, the developer of the system
serving global, integrated data view). Although
different ways of balancing this responsibility are
possible, we assume that in order to simplify the
integration itself, local sources should be obliged to
adapt their contribution as far as possible based on the
knowledge of their local resources.
updateable object views mechanism developed for our
framework is well suited for this purpose [7]:
 The virtual objects provided by a view can be
processed analogously like regular objects.
 The view definition allows to independently specify
the behavior for its objects’ read, update, insertion
and removal, including arbitrary side effects.
 Any of abovementioned generic operations can
remain undefined, thus limiting the access to objects
covered by a view.
Another factor of high importance for the
productivity of the proposed approach is the choice of
programming language used to create local data
adapters and to integrate contributed data. To allow
high-level programming and to avoid the so-called
“impedance mismatch” among different programming
paradigms we choose SBQL (Stack-Based Query
Language) [12]. The language offers powerful query
operators as well as imperative constructs of typical
programming languages. It is fully orthogonal in the
sense its queries can serve as method building blocks,
input parameters or functional method output.
The object model assumed allows for direct
association links between objects, based on their
globally unique identifiers. Such links can reference
non-local sources. However, especially in case of
bottom-up integration scenario, the objects from
various sources may require also other identification
method, using traditional (domain specific or artificial)
uniqueness keys. For example, if several independently
build systems may store different data concerning the
some people, the Grid may require collecting those
data using join condition based e.g. on person’s social
security number.
2.2. Technology requirements
2.3. Possible architectures
As suggested in the previous subsection, both the
participant’s contribution as well as the final form of
Grid-served data needs to be published in a form
compliant with canonical data model. This requires
implementing or wrapping local data sources using the
same programming abstractions as those served to Grid
clients. The primary motivation behind it is simplifying
the data composition on the Grid level.
This arrangement requires the traditional database
functionality, with special emphasis on virtual (that is,
not materialized) object views. The views are
necessary at the Grid level (we call them global views)
for mapping multiple contributing data sources into the
globally served data. They are also needed at
contributors’ side (contributory views), to limit access
and adapt local data according to the contract
established by their contributory schemas. The
Another potential benefit of making the structure of
contributory data uniform with the global data is the
flexibility allowing to create multi-level Grids and
arbitrarily combined structures. The structure (see Fig.
1) is similar to a snowflake, where some nodes
contribute to a global view, which in turn becomes a
participant in another level of integration.
Fig. 1. Some of possible architectures of multi-level
Grids: snowflake, ring, random
to data source
CV
to clients
GV
GV
CV
Layer 1
Layer 1
Contributory
Contributory
Contributory
Schemas
Schemas
Schemas
Fig. 2. Illustration of the layered Grid idea. CV stands for
contributory view, GV stands for global view and NLD
stands for node local data. The term “global” is used here
relatively to the lower level.
An integration point not necessarily has to be
unique. Using exactly the same notions we may also
create a ring-like Grids or even randomly connected
databases (Fig. 1). Only the resources available, market
tradeoffs and business strategic decisions limit such
integration. Power of combination of contributory
views and global views thanks to uniform object model
is unlimited. Fig. 2 shows the idea of data integration
done by global view (on the left), which is a
connection point for Grid’s clients. Contributory views
transform data into a form required by a global view
(described in contributory metadata) according to
concrete node limitations. The right part of Fig. 2
shows that sometimes a global view may serve as a
contributory view if it needs to participate as a node in
another Grid. That is possible because all metadata
used is compliant (see the next section).
3. The role of metadata
Although the integration scenario described above
is rather static in the sense the considered data
structures are determined at the design time, the proper
handling of metadata remains critical for the successful
integration. In this section, we summarize the
requirements concerning schema definitions, present
the core concepts of the relevant metamodel and
discuss the possible ways of representing Grid
metadata.
3.1. Schema definitions
NLD
CV
NLD
CV
NLD
to data source
GV
Integrated
(global)
Schema
CV
to data source
CV
CV
CV
GV
NLD
CV
NLD
CV
NLD
The process of integration deals with several levels
of data and metadata describing them. In this
subsection, we explain the already mentioned terms of
different schemas used in the Grid design. Starting
from the description of local resources those are the
following:
 Local schema describes the original form of data as
available for node’s local applications. We do not
deal directly with this kind of schema since it is
participant’s responsibility to transform the data to
the form described by appropriate contributory
schema. Local schema would specify what local data
are to be published (contributed) and what data
remains private (local).
 Contributory schema describes the required
outcome of participant’s work on preparing its
contribution. As already described, the shape of each
contributory schema is subordinate to the Grid’s
global schema requirements in the sense it should
make the merging of the described contribution into
the global schema as straightforward as possible.
 Global schema describes the integrated data as
visible to the Grid clients. An important assumption
made here is that a global schema may serve as a
contributory schema for higher-level integration.
 Integration schema contains the missing metadata
elements that describe the integration of contributory
schemas into global schema. The mapping is realized
by a set of virtual view definitions, called here global
view. We do not assume any specialized data
definition language constructs for this metadata.
3.2. Metamodel
In this section, we sketch core elements of a
metamodel realizing the abovementioned features. Its
constructs serve as the building blocks of contributory
schemas and the global schema.
In our proposal we attempt to follow as far as
possible the popular object models’ notions (UML [9],
Java, IDL [10]/ODL [8]), although we avoid
introducing secondary notions and, as already
mentioned, try to achieve a higher level of object
relativism. An important extension compared to
traditional object models is the introduction of
dynamic object role notion [6] and dynamic inheritance
among objects that is assumed by it.
named, are derived from the abstract Metaobject
[conceptual model] class.
Data items (objects) are described through their
Type (for primitive values) or by Interface (in case of
complex objects). Interface allows to define behavior
(supported operations) and static properties (Subobjects
and Association Links) of their object. The term
“subobject” covers the notion of attribute and is used
here to emphasize the ability to create compositions of
complex objects. For each Structural Property its
Metaobject
+ name
Key
+ uniqueProperty *
*
+ keyMembership
1..* + keyElement
+ uniquenessScope
StructProperty
1 + multiplicity
+ insertable
+ readable
+ mutable
+ removable
Type
+ contents
Property
Operation
1
+ property *
+ super
+ sub
*
+ base
1
*
0..1 + owner
Interface
+ instanceName
+ isAbstract
+ usage
*
SubobjectLink
+ target 1
+ referrer
*
+ applicableRole
AssocLink
0..1
+ reverse 0..1
RoleInterface
*
Fig. 3. The core concepts of the contributed data
metamodel
To illustrate the discussed notions we follow the
UML style of metamodel definition. Those constructs
are necessary to describe contributing node’s database.
Note that we are only interested here in externally
accessible features programmer needs to know in order
to create a contributory view, which is transforming
local data structures to required form. Particularly, this
document does not deal with implementation details
(e.g. the mechanisms providing operations’
implementation are not discussed). Thus, we do not use
the term “class” and instead use the notion of interface
to provide the necessary objects’ description.
Fig. 1 shows the core constructs describing database
objects. All significant metadata notions that need to be
multiplicity is specified. Other attributes describing
Structural Property indicate, whether the following
generic operations: insertion, modification, removal,
read are supported for this property. Associations can
be declared as unidirectional (although even in such
case hidden opposite direction links would exist to help
protecting database integrity).
An important assumption of the presented model is
a separation of object definitions from their storage
structures. That is, in order to allow storing say
Employee objects in a database, one needs to provide
Employee interface definition as well as to define
global static property to store Employee instances. The
second task is realized by properties distinguished by
the lack of interface assigned to them. As a side effect,
it also allows to define global (that is – database-scope)
procedures.
1 + consumer
Operation 0..1
+ producer
+ owner
+ input
0..1
+ return
SignatureElement 1
0..1 + multiplicity
+ isOrdered
ConstructedValue
0..1
TypedValue
+ isMutable
+ isReference
+ signatureUsage *
+ bound
Binder
+ name
0..1
+ binder
+ contents 1..*
1 + type
Type
Fig. 4. The details of the operation’s signature in the
contributed data metamodel
For interfaces, traditional static generalizationspecialization declarations are allowed. Although not
further discussed in this document, the dynamic object
role notion [6] was introduced here as one of the
fundamental concepts of the assumed object model.
Dynamic object role provides a dynamic inheritance
mechanism among objects. From the data definition
point of view is treated similarly to object’s structural
properties (e.g. a multiplicity of roles connected to a
single base object can be defined).
Uniqueness constraints allow specifying single and
composite uniqueness keys. Note that if a given
interface is used in different places of database
structure, for each of those places [designated by
appropriate StructProperty declarations] a different
uniqueness constraint can be used. In other words, each
uniqueness constraint concerning complex objects is
scoped with respect to particular property declaration
rather than to all instances (extent) of a given interface.
This definition is further constrained by the concrete
syntax of schema definition language.
Fig. 4 presents operation’s signature. This is another
place where our object model differs significantly from
traditional object models, due to flexibility of queries
that may be used as operations’ parameters.
Procedure’s input and output data structures are treated
uniformly and can take form of:
 Typed value: a value of predefined primitive type
or an instance of schema-defined interface. Such
result can be optionally specified as immutable.
Moreover, the value can be referenced directly or
through a reference link (pointer).
 Binder: a named contents (that may be a structure
of any of kinds described in this list) accessible
through this name.
 Constructed value: a structure containing one or
more binders.
For each case the items may be specified as
multiple and (if so) – as ordered or unordered. To
follow a structural type matching paradigm, some
adjustments would be necessary.
Fig. 5 shows additional declarations that we find
important in the context of databases’ cooperation,
namely – the replication paths. It is possible to identify
the foreign databases involved. For each structural
property (global or interface-hosted) it is possible to
indicate:
 The foreign databases, to which the request of
updating a copy of a given property is forwarded.
 The foreign databases, from which the requests of
updating a given property can be received.
3.3. Global views as an integration specification
At the beginning of this section, we have
enumerated four kinds of schemas used within the
Grid. However, only three of them we are going to
describe using traditional data definition language
declarations. The integration schema is realized
implicitly through the global views definitions that
bridge the set of contributory schemas with the global
schema.
This raises the question: should not the integration
specification be also described by some higher-level
declarative mean instead of just view declarations with
their method bodies determining the accessible data?
Metaobject
+ name
StructProperty
Me
+
+
+
+
+
multiplicity
insertable
readable
mutable
removable
*
+ excludedItem
+ nonReplicationPeer
+ replicationPeer
*
PeerDatabase
*
+ location
*
+ replicatedItem
Fig. 5. Data replication description in the contributed data
metamodel
However, there are two reasons that justify keeping the
suggested solution:
 It is necessary to remember that the views
definitions are formulated using a query language
that provides very high-level expressive constructs.
 While some typical mappings could be further
simplified by providing dedicated constructs, the
benefit seems to be minimal, especially when
considering the virtual objects’ update behavior.
Thus, we conclude the view definitions to be an
optimum mean of representing the mappings between
contributory data and global schema.
MetaObject describedElement
MetaValue
*
name: string
1
metavalue value: string
kind: string
*
1 source 1 target
instance
*
*
MetaRelationship
description
1
kind : string
MetaAttribute
name: string
considered, the amount of metadata would need to be
radically extended.
Although the metamodel described above is not
open for such extensions, the metadata structure we
have proposed for the implementation level can
flexibly handle any necessary metadata “annotations”
[5]. The generic structure shown in Fig. 6 is capable of
expressing the predefined (core) constructs, as well as
any necessary extensions in the form of relationships
among metadata or [meta]attributes describing it.
However, we would like to stress, that although the
appropriately precise metadata can effectively assist
integration of new nodes, this process seems to require
human assistance and insight into possible kinds of
heterogeneity (especially concerning the domain
knowledge level).
Another issue is documenting the metadata created
during design. Since the contributory schemas and
global schema are described by data definition
languages based on traditional object-oriented notions,
they can be mapped into UML class diagrams in a
straightforward way. On the other hand, the view
definitions describing mapping between those
schemas, constitute a new quality as very powerful
programming notions. Development of an expressive
and universal diagrammatic method of representing
view definitions remains an open issue.
4. Related Research
Fig. 6. Generic extensible metadata structure proposed for
schema implementation
3.4. Other metadata and its representation
The suggested scope of metadata provided by
contributory schemas and a global schema is relatively
narrow: it is limited to the technical description of
provided data. It is possible thanks to the assumption
that the member data sources are subjects of designtime analysis and thus the semantic details of the data
can be handled implicitly. If some more dynamic (or
more “automatic”) integration scenario were
Data integration becomes increasingly important as
institutions, companies and administrations want to
cooperate on new levels, store and share enormous
number of data. There are many initiatives defining
metadata schemas specific to their domains as well as
new architectures for database integration. From this
point of view, our proposal is rather general, being able
to describe arbitrarily any object-oriented data and
service.
Currently, one of the most important approaches to
general shared data description is WSDL language [2].
It is a platform, service and programming language
independent standard, used by Open Grid Services
Architecture [4]. Its data description capabilities are
expressed by XML and XML Schema. Our data model
is rather a higher-level solution including all objectoriented database features like: inheritance, references,
multi-value properties, roles, updateable views and a
powerful query language. Thus, we are more focused
on well-known database model, that is objects
composed in complex structures rather than abstract
services. This allows high-level analysis of distributed
data and additional node-level and global-level
optimizations. Our grid clients may query the global
schema and get information much more sophisticated
than stored in a Web Service catalogue.
We also emphasize role of a Virtual Organization [3],
as an initiative and source of a global integrated
schema. Our grid building
strategy splits data
integration process description into two separated
aspects: global (after integration used by Grid clients)
and local (contribution required for integration).
According to our proposal, OGSA using Web Services
defines only local contributory metadata.
Comparing to other object database metadata
definition initiatives for object-oriented databases like
ODL/IDL [10, 8], we avoid introducing secondary
notions and achieve a higher level of object relativism.
In addition, assumed query language and object model
is one of the most powerful among those proposed for
ODBMS systems, allowing full retrieve, create, update
and delete interfaces plus message calling.
5. Conclusions and Future Work
In this paper, we have outlined our approach to data
Grid construction, focusing on the metadata required
by this process. Based on the required properties of
canonical data model and the programming constructs
needed for data integration, the metamodel describing
the core constructs was suggested. The notions of this
metamodel form the definitions of contributory
schemas and the global schema, which are established
during the Grid development process and contracted
through an agreement among participants.
The approach presented here is less idealistic than
some proposals assuming automatic or semi-automatic
discovery and use of data sources. An important reason
of this attitude is we take into account potentially very
complex data to be integrated (e.g. healthcare
databases integration) and intend to support data
updating at the global level. However, in further
perspective we consider investigating the ways to make
the structure more open for integrating new participant
nodes.
In our future research, we plan primarily to
implement the data integration features using our
Object-oriented DBMS prototype implementation.
Further directions include extending the metadata
towards more dynamic participants’ integration
scenarios.
6. References
[1] E. Christensen, F. Curbera, G. Meredith and S.
Weerawarana. Web Services Description Language (WSDL)
1.1. W3C, Note 15, 2001, [www.w3.org/TR/wsdl].
[2] D.C. Fallside, XML Schema Part 0: Primer. W3C,
Recomm. 2001, [www.w3.org/TR/xmlschema-0]
[3] I. Foster, C. Kesselman, S. Tuecke. „The Anatomy of
the Grid: Enabling Scalable Virtual Organizations.”
International J. Supercomputer Applications, 15(3), 2001.
[4] I. Foster, C. Kesselman, J. Nick, S. Tuecke, “The
Physiology of the Grid: An Open Grid Services Architecture
for Distributed Systems Integration.” Open Grid Service
Infrastructure WG, Global Grid Forum, June 22, 2002.
[www.globus.org/ogsa]
[5] P. Habela, K. Subieta: Overcoming the Complexity of
Object-Oriented DBMS Metadata Management. OOIS 2003:
214-225
[6] A. Jodlowski, P. Habela, J. Plodzien, K. Subieta.
Objects and Roles in the Stack-Based Approach. Proc. of
DEXA Conf., Springer LNCS 2453, pp. 514-523, Aix-enProvence, France, 2002
[7] H. Kozankiewicz, J. Leszczyłowski, K. Subieta. New
Approach to View Updates. Proc. of the VLDB Workshop
Emerging Database Research in Eastern Europe, Berlin,
Germany, 2003
[8] Object Data Management Group: The Object
Database Standard ODMG, Release 3.0. R.G.G.Cattel,
D.K.Barry, Ed., Morgan Kaufmann, 2000
[9] Object Management Group: Unified Modeling
Language (UML) Specification. Version 1.4, September
2001 [http://www.omg.org].
[10] Object Management Group: The Common Object
Request Broker: Architecture and Specification. Ver. 3.0,
July 2002 [http://www.omg.org].
[11] M. Roantree, J. Murphy and W. Hasselbring. The
OASIS Multidatabase Prototype. ACM Sigmod Record,
28:1, March 1999.
[12] K. Subieta, C. Beeri, F. Matthes, J.W. Schmidt: A
Stack-Based Approach to Query Languages. East/West
Database Workshop 1994: 159-180
Download