Chapter 6: General Schema Manipulation Operators PRINCIPLES OF DATA INTEGRATION ANHAI DOAN ALON HALEVY ZACHARY IVES Outline Introduction to model management and motivation The merge operator The ModelGen operator The Invert operator Model Management Operators We saw operators for creating mappings between pairs of schemas. But you can imagine other operators on schemas and mappings: Merge schemas, compose and invert mappings, translate schemas from one data model to another In fact, imagine an entire algebra of operators that apply to schemas and to mappings: Many common workflows can be formulated as a sequence of such operators [Bernstein, 2000] Note: “model” = “schema”. More terminology coming soon. Example of Model Management (1) In a data integration scenario, you may proceed as follows, beginning with sources S1 and S2: Use a match operator to create a mapping between S1 and S2 Use merge to create a merged (mediated) schema of S1 and S2 with mappings. Merge will create the minimal schema that includes both S1 and S2. Example of Model Management (2) Suppose we have another source S3, which is very similar to S1. We could first use match to create a mapping from S1 to S3 Then use compose to create a mapping from S3 to the mediated schema G. Operators Match: see previous chapters Merge: create a merged schema of S1 and S2 w.r.t. a mapping M12 ModelGen: create an equivalent model but in a different data model (e.g., relational XML) Invert: given M12, create M21 Diff: find the difference between two models (see bibliography) Some Terminology Model: a specific description of a set of data in a given data model. Meta model: a data model, such as relational schema, XML DTD, java class definitions, … Meta-meta-model: a generic language that is independent of a particular meta-model Usually, some a graph-based formalism. Outline Introduction to model management and motivation The merge operator The ModelGen operator The Invert operator The Merge Operator Given Two models, M1 and M2 A mapping from M1 to M2 Create: A merged model M12 that contains only the information in M1 and M2, but does not repeat information that is in both Mappings from M1and M2 to M12 Challenge to many model management operators: Can you develop algorithms that are generic, i.e., not specific to particular data models? Merge Challenges: Example Challenge 1: different attribute representations. Resolution should be part of the input mappings. Merge Challenges: Example Challenge 2: merging models of different data models. (What if one data model supports subattributes and another doesn’t?) See ModelGen. Merge Challenges: Example Challenge 3: “fundamental conflicts”. Zipcode is an integer in one model and string in another. Merged model cannot have both: Solutions depend on particular conflict and data models involved. Outline Introduction to model management and motivation The merge operator The ModelGen operator The Invert operator The ModelGen Operator Transform a schema from one meta-model (e.g,. Java object model, relational, XML) to another metamodel. Main challenge: features that exist in the source meta-model may not exist in the target (e.g., subclasses and inheritance). The need for ModelGen is very common in practice and is used by several of the other operators. ModelGen Example Java classes relational tables No classes or inheritance in the relational model ModelGen Strategy Possible to design specific transformations from one meta-model to another, but we want a generic approach. Design a super meta-model that has (almost) all features that exist in the meta-models. The super meta-model knows which features are present in each meta-model. The algorithm will translate a given model into the super meta-model and from there to the target meta-model. ModelGen Algorithm Input: model M1 in meta-model MM1 Output: a model M2 in meta-model MM2 that is equivalent to M1. Transform M1 to the super-model, yielding M’. While M’ includes features that are not present in MM2, apply transformations to remove these features (e.g., remove class hierarchy by translating it to multiple vertically partitioned tables) Transform M’ into M2 Outline Introduction to model management and motivation The merge operator The ModelGen operator The Invert operator The Invert Operator Schema mappings are often directional: They map data in source schema into a target schema. Natural question: Can we find an inverse mapping? But what is the right definition of inverse. We’ll see a couple of failed attempts before we see a good one. Note: algorithms here are not generic. Highly dependent on the meta-model. Invert Definition: Attempt 1 Given a mapping M between a source S and target T. M defines a relation between pairs of instances (I,J) that are consistent with each other: I is an instance of S, J is an instance of T. Hence, a natural definition is: M-1 should define the relation (J,I), where (I,J) in M. However, inverses defined this way will not be expressible with tuple-generating dependencies/GLAV mappings. Why? See next slide. Attempt #1 Problem Explained Any relation defined by TGDs is closed up on the right and closed down on the left. Formally, assume (I,J) is in M I’ is a subset of I, J is a subset of J’, then (I’, J’) is also in M. However, by definition, M’ would have to be closed up on the left and closed down on the right Hence, cannot be defined with TGDs or GLAV. Invert Definition: Attempt 2 Definition by composition: M composed with M’ should be the identity mapping! However, it can be shown that under that condition, a mapping has an inverse only if the following holds: If I1 and I2 are two distinct instances of S, then their targets under M should be distinct instances of T. The above result considerably limits the mappings that have inverses. m1 and m2 won’t have inverses: m1 : P(x, y) ® Q(x) m2 : P(x, y, z) ® Q(x, y)Ù R(y, z) Third Time’s a Charm: Quasi inverses Define equivalence between two instances w.r.t. M as: I1 @ I 2 if (I1, J) Î M iff (I2, J) Î M Define M’ to be the quasi-inverse of M if the composition of M and M’ always maps I to an instance I’ such that I @ I ' Example: m : P(x, y) ® Q(x) m' : Q(x) ® $yP(x, y) {P(1, 2)} ® {Q(1)} ® {P(1, A)} m m' {P(1, 2)} @ {P(1, A)} So m is a quasi-inverse of m’ Summary of Chapter 6 Generic model management operators save a lot of repetitive code and can result in several forms of efficiency gains Employing such operators also ensures that applications think carefully about the meaning of what they are doing. Two main open challenges: Can the implementation of these operators be described in a meta-model independent fashion? Is model management a system in itself that should be built or should operator implementations be individual services?