XML Schema Change Detection, Versioning and Merging Submitted by: Abdullah Mohammad Baqasah M.S. (Information technology) B.S. (Computer Science) A thesis submitted in total fulfilment of the requirements for the degree of Doctor of Philosophy School of Engineering and Mathematical Science College of Science, Health and Engineering La Trobe University Bundoora, Victoria, 3086 Australia January 2015 Content Content .................................................................................................................................. i List of Figures ........................................................................................................................ii List of Tables ........................................................................................................................ iii Abstract.................................................................................................................................iv Statement of Authorship........................................................................................................ v Contributions and Thesis Outcomes .....................................................................................vi Acknowledgements .............................................................................................................. vii Chapter 1: 1.1. Introduction ...................................................................................................... 1 Background............................................................................................................. 1 1.1.1. XML Schema Popularity .................................................................................. 1 1.1.2. XML Schema and its Components ................................................................... 1 1.2. Motivation ............................................................................................................... 2 Chapter 2: Monitoring the Design of XML Schema Versions: A Survey ............................. 4 Chapter 3: Conclusion ....................................................................................................... 6 References ........................................................................................................................... 7 Page i List of Figures Figure 1.1 XML Schema component diagram ....................................................................... 2 Figure 2.1 XML Schema trees T1 and T2.............................................................................. 5 Page ii List of Tables Table 1.1 Schema changes................................................................................................... 3 Page iii Abstract The eXtensible Markup Language (XML) is a meta-language that is widely used to provide a non-proprietary universal format for sharing hierarchical data across different software systems and application domains. Specifications and standards used in these domains are defined by XML Schema Definition Language (XSD). XML Schema standards tend to change over time for a multitude of reasons, such as the introduction of new requirements and the correction of errors in the initial design. Moreover, complex structures (e.g., complexType in XML Schema) are created but not fully configured leaving a space for future extension or restriction. Indeed, each standard ends up with different versions of the same schema. In this context, managing different versions of the schema and their version changes can be critical for optimal functioning of the XML-based application. The work in this thesis is devoted to addressing the three topics of XML Schema change detection, versioning and merging along with related issues, such as the appropriate design of the change model (delta) used to store XSD changes, the storage technique for XSD versions, the methodology of inserting, retrieving, comparing, and merging XSD versions. Page iv Statement of Authorship Except where reference is made in the text of the thesis, this thesis contains no material published elsewhere or extracted in whole or in part from a thesis submitted for the award of any other degree or diploma. No other person’s work has been used without due acknowledgment in the main text of the thesis. This thesis has not been submitted for the award of any degree or diploma in any other tertiary institution. I declare that the research in this thesis is my own original work during my PhD candidature under the supervision of members of the advisory panel, i.e., Dr. Eric Pardede (main supervisor), Prof. Wenny Rahayu (co-supervisor), except where otherwise acknowledged in the text. Abdullah Baqasah Page v Contributions and Thesis Outcomes Journals 1. Baqasah, A., Pardede, E., Holubova, I., and Rahayu, W. (2015) XS-Diff: XML Schema Change Detection Algorithm, International Journal of Web and Grid Services (IJWGS) 2. Baqasah, A., Pardede, E., and Rahayu, W. (2015) Maintaining Schema Versions Compatibility in Cloud Applications Collaborative Framework, World Wide Web Journal (WWWJ) Conferences 1. Baqasah, A., Pardede, E., Holubova, I., and Rahayu, W. (2013) On Change Detection of XML Schemas, In 2013 IEEE 12th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom2013), pp. 974-982 2. Baqasah, A., Pardede, E., and Rahayu, W. (2014) XSM - A Tracking System for XML Schema Versions, In 2014 IEEE 28th International Conference on Advanced Information Networking and Applications (AINA2014), pp. 1081-1088 3. Baqasah, A., Pardede, E., and Rahayu, W. (2014) A New Approach for Meaningful XML Schema Merging, In the 16th International Conference on Information Integration and Web-based Applications & Services (iiWAS2014) Page vi Acknowledgements First of all, praise and glory be to Allah the Almighty God who gave me the strength to complete this research. Next, my sincere thanks go to my first supervisor, Dr. Eric Pardede, for his unwavering support throughout my PhD candidature, and for his insightful suggestions in shaping this thesis. My special thanks also go to my second supervisor Prof. Wenny Rahayu, for her expert guidance and extraordinary effort in ensuring that the thesis met appropriate scholarly standards. Page vii Ch. 1. Introduction Chapter 1: Introduction Section 1.1. of this chapter gives a general background. With the number of applications, users, and corporations in various industries rising every day, the amount of data received, stored, processed, and interchanged grows as well. In addition to storing the data in relational database systems, it can be represented by the eXtensible Markup Language (XML) format [W3C, 2008]. The structure of the exchanged XML data can differ not only between corporations, but even among departments of a particular corporation. For instance, if one corporation receives data from its partner, it has to convert it to its own format. The data conversion in this case usually involves overheads in terms of time and resources to convert the data. These overheads can be minimised by the creation of unified standards for that exchange. 1.1. Background 1.1.1. XML Schema Popularity With the growing popularity of XML technology, combined with certain shortcomings of DTDs, a large number of alternative schema languages is proposed, such as RELAX NG [Clark & Makoto, 2001], Schematron [Jelliffe, 2000], and XML Schema (XSD) [W3C, 2004a]. A study conducted by [Grijzenhout & Marx, 2013] to analyse the quality of XML web shows that 13,950 valid XML documents reference XSDs compared to only 2,046 valid XML documents that reference DTDs. The study indicates that documents referencing an XSD are more reliable than those referencing a DTD because 86.3% of the downloadable XSDs are compiled (i.e., all referenced schemas can be retrieved and are syntactically correct) compared to 29.8% of the downloadable DTDs. Given this comparison, XML Schema seems to be the most accepted schema language. 1.1.2. XML Schema and its Components The XML Schema 1.0 specifications were developed as an official recommendation by the World Wide Web Consortium (W3C) in 2001. The specification was then revised in a second edition in 2004, and in 2012, version 1.1 of the specification became official. This new version contains several significant improvements as well as many small changes. The schema specification consists of three parts. For more information see Figure 1.1. Page 1 Ch. 1. Introduction Schema Figure 1.1 XML Schema component diagram 1.2. Motivation We justify the objective of the thesis by looking at different use cases where XML Schema change control is required. These use cases discussed by W3C working group1 are based on real examples submitted by users of XML Schema. The analyses of use cases below describe how XML Schema validators behave when they receive different XML Schemas and instances corresponding to them. In our thesis, we aim to provide a new methodology for better controlling of XML Schemas and their development process. Therefore, the use cases are devised so as to examine different issues that one may come across when versioning, differencing, and merging XML Schemas. Each use case corresponds to one or more of the above XML Schema issues (i.e., differencing, versioning, or merging) and explains its requirements with examples. To navigate schema changes, see Table 1.1. 1 Available at: http://www.w3.org/XML/2005/xsd-versioning-use-cases.html Page 2 Ch. 1. Introduction Table 1.1 Schema changes No. Change type Affected node/s Changes from Vb to V1 1 update unit elements under value1 and value2 elements 2 insert Two local simple types of unit elements 3 insert Two enumerations with value ‘mmHg’ Changes from V1 to V2 4 update magnitude elements under value1 and value2 elements 5 insert Two local simple types of magnitude elements 6 insert minInclusive and maxInclusive facets of the first magnitude element 7 insert minInclusive and maxInclusive facets of the second magnitude element Change description Remove the built-in string data type from both unit elements. Insert two local simple types with restrictions under the declaration of the two unit elements. Insert one enumeration as a restriction facet to the two inserted simple types in the previous operation. Remove the built-in decimal data type from both magnitude elements. Insert two local simple types with restrictions under the declaration of the two magnitude elements. Insert minInclusive and maxInclusive facets with values ‘90’ and ‘140’, respectively, to the inserted simple type of the first magnitude element. Insert minInclusive and maxInclusive facets with values ‘60’ and ‘90’, respectively, to the inserted simple type of the second magnitude element. Page 3 Ch. 2. Monitoring the Design of XML Schema Versions: A Survey Chapter 2: Monitoring the Design of XML Schema Versions: A Survey In this chapter, we conduct a thorough literature review discussing prior research on topics related to this thesis. Section Error! Reference source not found. presents existing work on detecting XML Schema changes. The first section investigates existing effort on change detection for hierarchically-structured data, XML documents, and XML Schemas as a basis to control XML Schema versions and to store deltas. It then discusses works related to XML Schema evolution to identify changes that need to be maintained in the schema. Section Error! Reference source not found. discusses existing work in XML Schema versioning. First, it studies existing DBMS tools that support versioning (Section Error! Reference source not found.). Then, it shows past research efforts to version both XML documents (Section Error! Reference source not found.) and XML schemas (Section Error! Reference source not found.) followed by a discussion of related work on XML Schema quality analysis as a subtask in the versioning process (Section Error! Reference source not found.). In Section Error! Reference source not found., we study the versioning functionality in terms of the existing approaches of merging XML versions, and related issues such as conflict resolution and patching. The aim of this survey is to investigate the current state of research on XML Schema versioning and merging, and to highlight the issues that remain outstanding. Finally, we conclude the chapter in Section Error! Reference source not found.. Page 4 Ch. 2. Monitoring the Design of XML Schema Versions: A Survey street commen t 7 seq 20 seq 20 seq 40 minIn 44 45 0 2 3 pattern 14 maxEx 1 4 trackID local-to-global 0 billTo 14 4 2 46 47 43 seq shipTo 13 [ST] maxEx 3 1 orderID 2 shipDate USPrice 13 [ST] 1 co mmen t quantity productName 0 4 0 42 [CT] 48 30 32 31 33 zip 11 12 15 16 17 Local order 23 24 34 50 city 4 partNum 3 41 18 28 seq 25 state 0 9 [CT] 49 pNumType country 8 item 27 USAddress street zip 3 7 seq 10 seq 2 19 POType fullname state 1 6 Items orderDate city 0 5 items 33 37 39 comment 32 35 3 deliveryInfo 31 34 38 pattern 30 street 3 29 name 2 orderID 1 orderDate 17 0 28 seq 26 36 SKU shipDate 2 16 co mmen t 1 24 items 15 USPrice quantity productName 0 12 23 18 partNum 11 22 comment 10 seq 21 billTo 0 9 [CT] 25 shipTo 8 item 27 USAddress commen t pOrder 19 POType pOrder country 6 Items productName 5 1 schema orderDate 4 1 schema countryCode 3 T2 country 2 T1 1 local-to-global global-to-local global-to-local Legend migrated/moved node deleted node inserted node updated node migration type Figure 2.1 XML Schema trees T1 and T2 Page 5 Ch. 3. Conclusion Chapter 3: Conclusion This final chapter of the thesis concludes this study with a summary of the research conducted in this study and a reflection on avenues for future research. The first part of the chapter outlines the work detailed in the previous chapters of the thesis. The second part turns to suggestions on how the research in the area of XML Schema version management, differencing, and merging can be extended. Page 6 References 1. 2. 3. Page 7