A ToolBox for Conservative XML Schema Evolution and Document Adaptation Joshua Amavi, Jacques Chabin, Mirian Halfeld Ferrari, and Pierre Réty Univ. Orléans, INSA Centre Val de Loire, LIFO EA 4022, FR-45067 Orléans, France {joshua.amavi,jacques.chabin,mirian,pierre.rety}@univ-orleans.fr Abstract. We propose an algorithm that computes a mapping to obtain a conservative extension of original local schemas. This mapping ensures schema evolution and guides the construction of a document translator. 1 Introduction Our goal is to establish a multi-system environment composed by a global central system which is a conservative evolution of local ones, capable of processing changes that can then be transmitted to local systems. The communication should be possible in both directions: local-to-global and global-to-local. We allow independent local services to continue working on their own data, with their own tools while permitting diagnosis and changes based on a general and complete view of all services. This scenario requires tools for dealing with type evolution and document adaptation. It can be useful as a temporary configuration, deferring complete integration until local systems are ready, or as a flexible architecture adopted by the enterprise. In this context, we suppose that S1 , . . . , Sn are local systems which deal with sets of XML documents X1 , . . . , Xn , respectively, and that inter-operate with a global, integrated system S. Each set Xi conforms to schema or type constraints Di , while D is an extended type (of S) that accepts any local document from D i . We assume that the global system S may evolve to S , accepting more documents or rejecting some original ones. Our goal is to propose tools allowing automatic type transformation accompanied by automatic document translation. We implement a platform1 where all our proposed tools will be available (refer to [3] for details): • ExtSchemaGenerator ([5]) extends a given schema G, seen as a regular tree grammar, into a new grammar G respecting the following property: the language generated by G is the smallest set of unranked trees that contains the language generated by G and the grammar G is a Local Tree Grammar (LTG) or a Single-Type Tree Grammar (STTG). 1 From previous work: ExtSchemaGenerator , available on http://www.univ-orleans.fr/ lifo/Members/rety/logiciels/RTGalgorithms.html XMLCorrector , available on http://www.info.univ-tours.fr/~ savary/English/xmlcorrector.html H. Decker et al. (Eds.): DEXA 2014, Part I, LNCS 8644, pp. 299–307, 2014. c Springer International Publishing Switzerland 2014 300 J. Amavi et al. • XMLCorrector ([2]) corrects an XML document w.r.t. schema constraints expressed as a DTD (or an LTG). The corrector reads the entire XML tree t to propose solutions. XMLCorrector finds all solutions within a given threshold th. • MappingGen: We propose an algorithm that applies the ideas of [5] to generate a mapping from one schema G, seen as a regular tree grammar, to an extended schema G which will be an LTG. The resulting schema mapping m is a sequence of operations on grammar rules that indicates, step by step, how to transform G into G following the approach in [5]. Given a mapping m we can easily compute its inverse m−1 or compose it to other mappings; allowing schemas to evolve. • XTraM : Based on a given mapping m (from schema S to T ), we propose a method to translate an XML document (or tree) t, valid w.r.t. S into a document t valid w.r.t. T . The edit distance between t and t is no higher than a given positive threshold th. Moreover, t is the closest tree to t, obtained by changing t according to the schema modifications imposed by m. For each edit operation on S, to obtain T , we analyse what should be the corresponding update on document t. When this update violates validity, we use XMLCorrector to propose corrections to the subtree involved in the update. 2 Schema Evolution An XML document is an unranked tree, defined in the usual way as a mapping t from a set of positions P os(t) to an alphabet Σ. We deal with XML schema represented as RTG (regular tree grammar) G = (N, Σ, S, P ), where: N is a finite set of non-terminals; Σ is a finite set of terminals; S is a set of start symbols, where S ⊆ N and P is a finite set of production rules of the form X → a [R], where X ∈ N , a ∈ Σ, and R is a regular expression over N . As usual, in this paper, our algorithms start from grammars in reduced and normal form. Among RTG we are interested in local tree grammars (LTG) which have the same expressive power as DTD. We refer to [5] for details. Conservative XML Type Extension (ExtSchemaGenerator). In order to compute an LTG that extends minimally a given RTG, we follow the idea of ExtSchemaGenerator, presented in [5]. This method is very simple when dealing with the generation of an LTG from an RTG: replace each pair of competing non-terminals by a new non-terminal, until there are no more competing non-terminals. The regular expression of a new non-terminal rule is the disjunction of the regular expressions associated to competing non-terminals. Consider three hospital services (patient and treatment service, insurance service and bill service), each one having its own LTG (or DTD) as schema. Figure 1(lines 1-5) shows the RTG obtained by the union of the production rules of all these three grammars while Figure 1(lines 3-6) shows the resulting LTG. The obtained LTG is an extension of the original RTG since it generates all trees generated by the original RTG and possibly others as well (refer to example of Figure 3). Clearly, the obtained grammar is also an extension of each hospital A ToolBox for Conservative XML Schema 1 2 3 4 5 6 H1 → hospital[I1∗ ] H2 → hospital[I2∗ ] H3 → hospital[I3∗ ] I1 → inf o[P | T ] I2 → inf o[C | P ol] I3 → inf o[B] P → patient[S · N · V ∗ ] C → cover[S · P N ] B → bill[S · It∗ · D] ∗ V → visitInf o[Id · D] P ol → policy[P N · Id ] T → treatment[Id · T N · P R] P R → procedure[T ∗ ] H1 → hospital[I1∗ | I1∗ | I1∗ ] 301 It → item[Id · P Z] I1 → inf o[(P | T ) | (C | P ol) | B] Fig. 1. Lines 1-5 : RTG obtained from the union of production rules of grammars. Lines 3-6 : LTG obtained by algorithm in [5] from the RTG (lines 1-5). service grammar. In Figure 3(a) we find a document valid w.r.t. the billing local schema. It is also valid w.r.t. the global schema of Figure 1(lines 3-6). Schema Mappings. We say that a source schema (or grammar) evolves to a target schema. A schema mapping is specified by an operation list, denoted as an edit script, that should be performed on source schema in order to obtain the target schema. We propose an algorithm that generates a mapping to translate an RTG G into an LTG G , following the lines of [5] . Our mapping is composed by a sequence of edit operations that should be applied on the rules of grammar G in order to obtain G . In the following definition, let ed be an edit operation defined on RTG G. We denote by ed(G) the RTG obtained by applying ed on G. Each edit operation is associated with a cost that can be fixed according to the user’s priority. The cost of an edit script is the sum of the costs of the edit operations composing it. Definition 1 (Edit Script). An edit script m = ed1 , ed2 , . . . edn is a sequence of edit operations edk where 1 ≤ k ≤ n. Let G be an RTG, an edit script m = ed1 , ed2 , . . . edn is defined on G iff there exists a sequence of RTG G0 , G1 , . . . , Gn such that: (i) G0 = G and (ii) ∀ 1 ≤ k ≤ n, edk is defined on Gk−1 and edk (Gk−1 ) = Gk . Hence, m(G) = Gn . The empty edit script is denoted n (cost(edi )). 2 . The cost of an edit script m is defined as cost(m) = Σi=1 Definition 2 (Schema Mapping). A schema mapping is a triple M = (S, T, m), where S is the source schema, T is the target schema, and m is an edit script that transforms S into T (i.e., m(S) = T ). We say that M is syntactically specified by, or, expressed by m. 2 Edit Operations. In this section, the problem of changing one RTG into another is treated as a tree editing problem. Tree Representation for Production Rules. We represent the right-hand side of a production rule X → a [R] as a tree denoted trX such that trX = a(tR ). The root of trX is the terminal a which has only one subtree tR . For example, in Figure 2, the tree on the top left corner is trI with I → inf o[T.(Y.Co)]. Definition 3 (Well Formed Tree). A tree t representing the right-hand side of a production rule is well formed iff the following conditions are verified: (i) the root is a terminal symbol with only one child; (ii) the leaves nodes are in 302 J. Amavi et al. N ∪ {} and (iii) the internal nodes are in the set {|, ., ∗} having exactly one child (if it is in {∗}) or at least one child, in the other cases. 2 Elementary Edit Operations. In [3], we define elementary edit operations by using rewriting. Here we give an intuition of some of these operations. Given an RTG G = (N, Σ, S, P ) in normal form, an elementary edit operation ed is a partial function that transforms G into a new RTG G . The elementary edit operation ed can be applied on G only if ed is defined on G. Below we distinguish four types of elementary edit operations on RTG, where A ∈ N : 1. Edit operations to modify the set of start symbols S: set startelm(A) and unset startelm(A) to add/delete the non-terminal A to/from S. 2. Edit operations to modify non-terminal or terminal symbols in a content model. For example, ins elm(X, A, u.i): (cf. Figure 2(ed1 )) applies the −→ op op y y to insert A in the position u.i of tree repx A rewrite rule x resenting the production rule associate to X; del elm(X, A, u.i): (cf. Figure 2(ed2 )) is the inverse operation. The other operations are: rel root(X, a, b): (cf. Figure 2(ed3 )) to rename the root from a to b; rel elm(X, A, B, u): (cf. Figure 2(ed4 )) to rename non-terminal A at position u into B. 3. Edit operations to modify operator symbols in a content model are similar to those described in the previous item, but apply on operator nodes. See for instance Figure 2(ed5 ), (ed6 ) and (ed7 ). 4. Edit operations to modify the set of production rules P are: ins rule(A, a) that adds the new production rule A → a [] to P and the non-terminal A to S, where A ∈ N and del rule(A, a) its inverse. After each edit operation, the sets Σ and N are automatically updated to contain all and only the terminal (resp. non-terminal) symbols appearing in P . Proposition 1. Let G and G be two RTG. There exist an edit script, composed 2 only by elementary edit operation, that transforms G into G . Non-elementary Edit Operations. For readability and cost estimation, we define short-cut operations, i.e., operations seen as a one-block operation but equivalent to a sequence of elementary edit operations, such as ins tree(X, R, u.i), ins treerule(A, a, R) and inversely del tree(X, R, u.i), del treerule(A, a, R). Operation Cost. For each edit operation ed, we define a non-negative and application-dependent cost. On the one hand, we assume that operations that do not change the language generated by the RTG G on which they were applied, are 0-cost. Their goal is just to simplify a given regular expression. For instance, del opr(X, opr, u.i) where trX (u) = trX (u.i) = opr and del opr(X, opr, u.i) where trX (u.i) ∈ {|, .} and trX (u.i) has exactly one child, are 0-cost operations. On the other hand, we suppose that an elementary edit operation costs 1, while a non-elementary edit operation costs 5. A ToolBox for Conservative XML Schema info I: info 0 .0 ins elm(I,Nb,0.1.1) . 0.0 .0.1 0.0 ed1 .0.1 T T Y Co Y Nb Co 0.1.0 0.1.1 0.1.0 0.1.1 0.1.2 I: info I : information rel root(I,info,information) .0 .0 .0.1 0.0 ed3 .0.1 0.0 T T Y Nb Co Y Nb Co 0.1.0 0.1.1 0.1.2 0.1.0 0.1.1 0.1.2 I: Af : .0 0.0 0.1 0.2 Ct Zp Afn ins opr(Au,|,0.2,2) ed5 . 0.1 0.2 0.3 Em Tel 0.0 N 0.1.0 Afn 0.1.1 Ct I : information .0 del opr(I,.,0.1,3) 0.0 .0.1 ed6 T Y Nb Co 0.1.0 0.1.1 0.1.2 I : information 0 . 0.0 T Y Nb Co 0.1 0.2 0.3 affiliation del elm(Af,Zp,0.1) ed2 . 0.0 Ct 0.1 Afn 0.0.0 Au 0.0.0 Au Au : .0 Af : Pu : publication Pu : publication 0 0 . . 0.1 0.1 rel elm(Pu,Pa,Paper,0.1) 0.0 0.0 * Paper ed4 * Pa author Au : affiliation 303 author .0 0.2 0.0 . 0.1 | N 0.1.0 0.2.0 0.2.1 0.1.1 Afn Ct Em Tel Bib : bibliography 0 * 0.0 Pu Bib: bibliography rel opr(Bib,*,.,0) 0 . ed7 0.0 Pu Fig. 2. Example of elementary edit operations Generating a Schema Mapping (MappingGen). Algorithm 1 generates a mapping that converts an RTG in an LTG by following the ideas in [5], explained in Section 2. This algorithm starts by determining a set of competing non-terminals ECa (lines 2-3). Then we can take arbitrarily in ECa , one of these non-terminals (say X0 ) to represent all others, i.e., when merging rules of competing terminals, one non-terminal name is chosen to represent the result of the merge (line 4). Recall that edit operations always deal with a production rule in its tree-like format. The new production rule of X0 is built in two steps. We add an OR operation as the parent of its original regular expression reg(X0 ) (line 5) and then we insert all regular expressions associated with its competing non-terminals as siblings of reg(X0 ) (line 7). In line 8 we just replace, in all production rules, non-terminals in ECa by X0 . Original rules of non-terminals in ECa are deleted (line 10) after, possibly, adjusting start symbols (line 9). Proposition 2. Let m be the mapping obtained by Algorithm 1 from an RTG G. The language L(m(G)) is the least LTL that contains L(G). Moreover, the grammar m(G) equals the one obtained by ExtSchemaGenerator . 2 Going Further with Mappings to Support Schema Evolution. In [6], it was shown how two fundamental operators on schema mappings, namely composition and inversion, can be used to address the mapping adaptation problem in the context of schema evolution. Let M1 be a mapping between XML schemas S and T . When S or T evolve, M1 shall be adapted. By using composition and inversion operators, one can avoid mapping re-computation. We now precise the notions of composition and inversion in our context. 304 J. Amavi et al. Algorithm 1. A mapping for transforming an RTG into an LTG Input: A Regular Tree Grammar G = (N T, Σ, S, P ) Output: An edit script m between G and the LTG G such that L(G) ⊆ L(G ) 1: m := 2: for each terminal symbol a ∈ Σ do 3: ECa = {X0 , . . . , Xk } is a set of competing non-terminals where term(Xi ) = a 4: Non-terminal X0 is choosed to represent X0 , . . . , Xk 5: Add ins opr(X0 , |, 0, 1) to m 6: for each non-terminal Xi ∈ {X1 , . . . , Xk } do 7: Add ins tree(X0 , reg(Xi ), 0.i) to m 8: Add rel elm(Y, Xi , X0 , u) to m, for all u where u is the position of Xi in the rule Y → b [R] ∈ P 9: Add set startelm(X0 ) to m where X0 ∈ S and Xi ∈ S 10: Add del treerule(Xi , a, reg(Xi )) to m 11: end for 12: end for 13: return m Definition 4 (Mapping composition and inversion). Given two mappings M1 = (S, T, m1 ) and M2 = (T, V, m2 ), the composition of M1 and M2 is the mapping M1 ◦ M2 = (S, V, m1 . m2 ). If m1 = ed1 , · · · , edn , then the inverse of −1 −1 −1 −1 mapping M1 is the mapping M−1 1 = (T, S, m1 ) where m1 = edn , · · · , ed1 −1 2 and edk (1 ≤ k ≤ n) is defined in [3] . 3 Adapting XML Documents to a New Type Correcting XML Documents (XMLCorrector) In [2], given a well-formed XML tree t, a schema G and a non negative threshold th, XMLCorrector finds every tree t valid w.r.t. G such that the edit distance between t and t is no higher than th. Contrary to most other approaches, [2] considers the correction as an enumeration problem rather than a decision problem and computes all the possible corrections on t. The algorithm, proved to be correct and complete in [2], consists in fulfilling an edit distance matrix which stores the relevant edit operation sequences allowing to obtain the corrected trees. The theoretical exponential complexity of XMLCorrector is related to the fact that edit sequences and the corresponding corrections are generated and that the correction set is complete. In this paper, contrary to [2], we do not consider all the possible corrections on t. The correction of XML documents is guided by a given mapping. For each edit operation on S, to obtain T , we analyse what should be the corresponding update on document t. When this update violates validity, we use XMLCorrector to propose corrections to the subtree involved in the update. 2 Notice that inverses are defined in the intuitive way as, for example, ins elm(X, A, u) del elm(X, A, u) or rel opr(X, p, q, u) rel opr(X, q, p, u). A ToolBox for Conservative XML Schema 305 Document Translation Guided by Mapping (XTraM). Our method consists in performing a list of changes on XML documents, in accordance with the edit operations found in the mapping. For example, adding or deleting a regular expression in a rule under the operator ’.’ is a mapping operation that provokes, respectively, the insertion or the deletion of a subtree in an originally valid XML tree (to maintain its validity). Similarly, renaming a non-terminal A by B, provokes the substitution of the subtree generated by A into the subtree generated by B. When local correction on XML subtrees are needed, XMLCorrector is used to ensure document validity. (, H1 ) (0, I10.0.0 ) info SSN price (a) SSN SSN trId date (b) (1, I10.0.0 ) info info (1.0, P 0.0.0 ) patient bill name visitInfo trId (0, I10.0.0 ) (0.0, P 0.0.0 ) (1.0, B 0.2 ) patient item date trId (1, I10.0.0 ) info info (0.0, P 0.0.0 ) bill (, H1 ) hospital hospital hospital item date price SSN patient name visitInfo SSN name visitInfo trId trId date date (c) Fig. 3. (a) XML tree valid w.r.t. the billing local schema. (b) XML tree valid w.r.t. the global schema of Figure 1(lines 3-6). (c) Tree resulting from the translation of (b) into the patient local schema (cf. Figure 1(lines 1-5)). Trees (b) and (c) are annotated. Consider an XML tree t valid w.r.t. schema S and a mapping m from S to T . Our method can be summarized in two steps: 1. Since t belongs to the language L(S), it is possible to associate a non-terminal A with each tree node position p generated by this non-terminal. We analyse t, detect each non-terminal and annotate it with its corresponding position u in the used production rule. This annotation respects the format (p, Au ). For example, in Figure 3(b), we notice that the tree node bill is generated by the non-terminal B whose position in trI1 is 0.2, noted as (1.0, B 0.2 ). 2. Each edit operation ed in m activates a set of modifications on t. When ed transforms a grammar into a new grammar containing the previous one, the set of modifications is empty. Otherwise, our method consists in traversing t (marked as in step 1) in order to find the tree positions which may be affected due to ed. Modifications on t are defined according to each edit operation and are not detailed here due to the lack of space. Obviously, if no position is affected, t does not change. Figure 3(b) shows an XML document concerning patients and bills. This document is valid w.r.t. the global schema but not valid w.r.t. to any local schemas. Translating the document of Figure 3(b) into a document respecting the patient schema we obtain the document of Figure 3(c). 306 4 J. Amavi et al. Related Work and Concluding Remarks Much other work deals with schema evolution. In [6] second order logic is needed to express some mapping compositions. This approach is the basis of [1,8,12]. We believe that the use of edit operations makes our approach simpler than theirs and gets on well with our previous work concerning XML document correction. Proposals, such as those in [7,9,10,4,11], use edit operations. ELaX [10] and Exup [4] are a domain-specific language that proposes to handle modifications on XSD and to express such modifications formally. An important originality of our approach is the automatic generation of a conservative extension of an RTG into an LTG and the fact that it may propose different solutions to be chosen by the user. XTraM is guided by a mapping and produces documents with corrections that do not exceed a threshold. Our ToolBox offers schema evolution mechanisms accompanied by an automatic adaptation of XML documents. Its conservative aspect guarantees great flexibility when a global integrated system co-exists with local ones. A prototype, implemented in Java, is been tested. As a first experiment, we have produced an LTG, in 24ms, by merging the grammars obtained from dblp DTD3 and HAL XSD 4 . MappingGen returned a 19-operation mapping. Then XTraM was used to adapt a 52-node document valid w.r.t. the computed LTG toward the HAL grammar, giving, in this case, 36 solutions in 22.6 s. As in this test, all possible translations can be considered, but the user may also interfere in an intermediate step, making choices before the end of the complete computation - guiding and, thus, restricting the number of solutions. References 1. Amano, S., Libkin, L., Murlak, F.: XML schema mappings. In: Proceedings of the 28th ACM Symposium on Principles of Database Systems, PODS 2009, pp. 33–42. ACM, New York (2009) 2. Amavi, J., Bouchou, B., Savary, A.: On correcting XML documents with respect to a schema. The Computer Journal 56(4) (2013) 3. Amavi, J., Chabin, J., Halfeld Ferrari, M., Réty, P.: A toolbox for conservative XML schema evolution and document adaptation. CoRR abs/1406.1423 (2014) 4. Cavalieri, F., Guerrini, G., Mesiti, M.: Updating XML schemas and associated documents through Exup. In: Proceedings of the IEEE 27th International Conference on Data Engineering, ICDE 2011, pp. 1320–1323. IEEE Computer Society, Washington, DC (2011) 5. Chabin, J., Halfeld Ferrari, M., Musicante, M.A., Réty, P.: Conservative Type Extensions for XML Data. In: Hameurlain, A., Küng, J., Wagner, R. (eds.) TLDKS IX. LNCS, vol. 7980, pp. 65–94. Springer, Heidelberg (2013) 6. Fagin, R., Kolaitis, P.G., Popa, L., Tan, W.C.: Schema mapping evolution through composition and inversion. In: Bellahsene, Z., Bonifati, A., Rahm, E. (eds.) Schema Matching and Mapping, pp. 191–222. Springer (2011) 3 4 http://dblp.uni-trier.de/xml/dblp.dtd http://import.ccsd.cnrs.fr/xsd/generationAuto.php?instance=hal A ToolBox for Conservative XML Schema 307 7. Horie, K., Suzuki, N.: Extracting differences between regular tree grammars. In: Proceedings of the 28th Annual ACM Symposium on Applied Computing, SAC 2013, pp. 859–864. ACM, New York (2013) 8. Jiang, H., Ho, H., Popa, L., Han, W.S.: Mapping-driven XML transformation. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 1063–1072. ACM, New York (2007) 9. Leonardi, E., Hoai, T.T., Bhowmick, S.S., Madria, S.K.: Dtd-diff: A change detection algorithm for dtds. Data Knowledge Engineering 61(2), 384–402 (2007) 10. Nösinger, T., Klettke, M., Heuer, A.: XML schema transformations. In: Decker, H., Lhotská, L., Link, S., Basl, J., Tjoa, A.M. (eds.) DEXA 2013, Part I. LNCS, vol. 8055, pp. 293–302. Springer, Heidelberg (2013) 11. Suzuki, N., Fukushima, Y.: An XML document transformation algorithm inferred from an edit script between DTDs. In: Proceedings of the 19th Conference on Australasian Database, ADC 2008, vol. 75, pp. 175–184. Australian Computer Society, Inc., Darlinghurst (2007) 12. Yu, C., Popa, L.: Semantic adaptation of schema mappings when schemas evolve. In: Proceedings of the 31st International Conference on Very Large Data Bases, VLDB 2005, pp. 1006–1017. VLDB Endowment (2005)