SCHEMALESS APPROACH OF MAPPING XML DOCUMENTS INTO RELATIONAL DATABASE Ibrahim Dweib, Ayman Awadi, Seif Elduola Fath Elrhman, Joan Lu CIT 2008 Sydney, Australia 8-11 July 2008 1 Why schema-less Many applications deal with highly flexible XML documents from different sources, which make it difficult to define their structure by a fixed schema or a DTD. Therefore, it is necessary for schema-less approaches to deal with such XML documents. 2 The method aims to overcome the challenges faced due to fixed shredding No loss of information while shredding. Reconstruction of original XML documents is easier and much faster. Maintaining XML document structure. Preserve the ordering nature of XML data. 3 Theory guidance The main mathematical concepts that are used in this method are: Definition 1: XML tree is composed of many sub-trees of different levels; it can be define as the following: i=1, 2 … n, represent the levels of XML tree, 0 represents the root Where, Ei is a finite set of elements in the level i. Ai is a finite set of attributes in the level i. Xi is a finite set of texts in the level i. ri-1 is the root of the sub-tree of level i. 4 Theory guidance (Con’t) Definition 2: A dynamic fragment (shred) df(i) is defined to be the attributes and texts (leaf children) of the sub-tree i of the XML tree plus its root ri-1, as follows: df(i) = (Ai, Xi, ri-1), Where: Ai is a finite set of attributes in the level i Xi is a finite set of texts in the level i. ri-1 is the root of the sub-tree of level i. 5 Design framework A master table for documents. Called "documents“ table, to keep information about documents themselves, documents(doc_id, doc_structure, ….. ), Additional fields may be added to keep all information about the document itself such as dates, statistics, types… etc. The doc_id is a unique id generated per document to identify documents. The doc_structure is a big text field containing a coded string describing each document structure, any changes on the document structure should be reflected in this field, such as adding a new tag or property, deleting an existing tag or property, or relocating a given tag or property to a different location in the same document 6 Design framework (Con’t) A second table to store the actual contents for all documents. Documents will be shredded into pieces of data that will be called tokens, each document element, tag, or property will be considered a token, the tokens table will have at the minimum this structure, tokens(doc_id, token_id, token_name, token_value). The token_id is the primary generated id for each token. The doc_id is the foreign key linking the tokens table to the documents table. token_name is the tag name or the property name as found in the original XML document. token_value is the text value of the XML tag property. 7 Design framework, (Con’t) “doc_structure” field construction rules: The doc_structure field is where the document structure maintained. It consists of long series of related keys. Each key should start with a given alphabet character, The letter 'T' for element (child), and the letter 'A' for attribute, These letters are necessary to delimit keys in the sequence. Then the letter is followed by a numeric number representing the token_id that this key is referring to, Example: T120 is a key referring to a token in the tokens table whose token_id = 120. 8 Design framework, “doc_structure” field construction rules: (Con’t) If the token has properties then the key representing this token in the doc_structure will be followed with a set of keys defining these properties. Example: T120A12A17A2 is a valid key string for token number 120 which has three properties defined by tokens number 12, 17, and 2. These properties appear in the original document in this order. 9 Design framework, “doc_structure” field construction rules: (Con’t) If the token has some children tags then these children will be represented as a key-string surrounded by angle brackets. Example: T120<T12T7<T2T1>T77> is a valid string that can be read, token 120 has three sub tags in this order: token 12, followed by token 7, then token 77, and token 7 itself has also two sub tags 2, and 1 in the given order. 10 Theory implementation on simple case study <books> <book id="11210" category="fiction"> <author id="a1" sex="m">M. John</author> <name>Computer Science 101</name> </book> <book id="11211"> <author>A. Mark</author> <name>Applied Math 101</name> <subject>Math</subject > </book> </books> Figure 1: XML document 11 Theory implementation on simple case study 99 100 Books 107 Book Book 102 101 103 Id "11210" Category "fiction" author 106 name 108 Id "11211" 109 111 110 author name subject A. Mark Math Applied Math 101 105 104 Id "a1" Sex "m" M. John CS 101 Figure 2: A tree representation for XML document in figure 1 12 Theory implementation on simple case study Doc_id 10 Doc_strcuture T99<T100A101A102<T103A104A105T106>T107A108<T109T110T111>> Figure 5: Documents table 13 Theory implementation on simple case study doc_id token_id token_name token_value 10 99 books Null 10 100 book Null 10 101 id 11210 10 102 category fiction 10 103 author M. John 10 104 id a1 10 105 sex m 10 106 name Computer Science 101 10 107 book Null 10 108 id 11211 10 109 author A. Mark 10 110 name Applied Math 101 10 111 subject Math Figure 6: Tokens table 14 EXPERIMENTAL Environment An Intel Core 2 Duo computer with 2 GHz CPU, 1 GB RAM, 256 MB shared Cache OS: Windows Vista home edition. Visual Basic 6 is used as software development kit with Microsoft Access 2003 as relational database target. Five XML documents with different sizes are used in the experiment. The data is taken from the XML data repository that is available at the web site of the School of Computer Science and Engineering, University of Washington. The performance metric is the time spent for mapping XML documents to relational database and the time spent for reconstructing these documents from relational database. The experiment is repeated five times and the mean value of those times is reported to obtain a realistic and accurate results. 15 EXPERIMENTAL RESULTS Document size 4 KB 28 KB 64 KB 602KB 1MB Mapping time (secs) 0.01988238 0.14977736 .3551445 3.574335 5.85278136 Reconstructing time (secs) 0.018990234 0.44980958 1.926836 18.305544 32.06255104 Table 1: The time spent for mapping XML documents to RDBMS, and the time for reconstructing them 16 EXPERIMENTAL RESULTS Time spend The time spent for mapping XML documents to RDBMS and the time spent for reconstructing them 35 30 25 20 15 10 5 0 Mapping time (secs) Reconstructing time (secs) 4 KB 28 KB 64 KB 602KB 1MB Document size 17 Conclusion (1) By using this method: Maintaining document structure at a low cost price and easily, Building the original document is straight forward, Performing first level semantic search is also achievable either on a single document or on all documents. 18 Conclusion (2) Method Limitation: Complex semantic search is not achievable easily in this structure. Document size is limited to memory size since we use DOM based parsing 19 Future Works Improving this method to achieve complex semantic search, differentiate between XML data type (i.e., strings, dates, integers), in order to apply less than or greater than queries. Making an intensive testing and compare our method with other methods in the literature to see its performance. Using SAX parsing for XML document to solve document size limitation. 20 Thank You for Your Time 21