Document 13135584

advertisement
2012 4th International Conference on Computer Engineering and Technology (ICCET 2012)
IPCSIT vol.40 (2012) © (2012) IACSIT Press, Singapore
Siblings Labeling Scheme for Updating XML Trees Dynamically
Hamdi A. Al-Jamimi, Ahmed Barradah and Salahadin Mohammed
King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
{aljamimi, g199968940, adam}@kfupm.edu.sa
Abstract. Labeling schemes in XML trees have been developed to optimize query retrieval, since they
provide a quick way to determine the type of relationships that are present among the nodes. In order to
efficiently determine structural relationships among XML elements and to avoid re-labeling for updates,
much research about labeling schemes has been conducted, recently. Although, the recent proposed XML
dynamic labeling schemes avoid re-labeling, sometimes they need to do a lot of computations to specify
unique labels for the inserted nodes. In this paper, we review and compare a number of Dewey-based
labeling schemes that support XML data updates. Several significant criteria have been identified for the
evaluation and comparison. Moreover, we propose a novel dynamic labeling scheme, called Siblings
Labeling, for the XML trees. The Siblings Labeling scheme is similar to Dewey coding, however it adds
additional divisions to the node label to support the relationship retrieval perfectly as well as update XML
tree when insertion with a very minimal re-labeling.
Keywords: XML, labelling scheme, XML storage, XML Query processing, XML indexing
1. Introduction
The eXtensible Mark-up Language (XML) has emerged recently as a new standard for the exchanging
and publishing of data over the Web [1]. To facilitate the XML queries, two main techniques have been
proposed structural index and labelling schemes. In that way, documents that follow the XML standard can
be viewed as trees. XML database systems often give each item in the document (node in the tree) a unique
logical identifier (called a label) and use these labels for an efficient processing of queries. Thus, the
labelling schemes can be utilized to optimize query retrieval, since the structural relationships of the nodes,
such as Parent-Child (P-C), Ancestor-Descendant (A-D) as well as document order, can be efficiently
established by comparing their labels. Moreover, the labelling schemes lie at the core of query processing for
many XML database management systems [2, 3]. XML data on the Web are subjected to frequent updates.
Designing labelling schemes for dynamic XML documents is an important problem that has received a lot of
research attention. Existing dynamic labelling schemes, however, often sacrifice query performance and
introduce additional labelling cost to facilitate arbitrary updates even when the documents are seldom
updated [4].
In this paper, we have two main contributions summarized as follows: First, based on significant
evaluation criteria (presented in Section 3), we discuss the core idea of several Dewey-based labeling
schemes, evaluate and compare them. Second, we propose a novel labelling scheme, sibling labelling that
utilizes the Dewey coding. The proposed labeling scheme is designed to efficiently support queries for both
static and dynamic XML documents. Even though the proposed scheme may require very minimal relabeling when updating the XML documents, the structural relationships (A-D, P-C, siblings) of the nodes
can be retrieved easily without overhead computations.
The rest of the paper is organized as follows. Section 2 reviews related work and introduce briefly the
idea of each surveyed labelling scheme. Section 3 presents the evaluation criteria, while Section 4 compares
21
the different labelling schemes against each other. We describe our proposed labelling scheme in details in
Section 5. Then, Section 6 concludes the paper and introduces the possible future work.
2. Background and Related Work
In this section, we introduce briefly the core idea of several Dewey-based labelling schemes including:
DeweyID [5], OrdPath [6], TJFast [7], dynamic Dewey (DDE) [4], compact dynamic Dewey (CDDE) [4],
and Dewey with even numbers [8].
Dewey ID: Tatarinov et al. [5] introduced Dewey coding into XML query processing such that each
node is associated with a vector of numbers that represents the node-ID path from the root to the node.
Dewey ID labels the nth child of a node with an integer n, and this n should be concatenated to the prefix (its
parent’s label). In practice, Dewey ID uses UTF8 to process the delimiter. When a node is inserted, Dewey
ID needs to re-label the sibling nodes after this inserted node and the descendants of these siblings.
OrdPath: ONeil et al. [6] introduced OrdPath, which is a dynamic labelling scheme variant of the
Dewey order. The label of each node is determined by the Dewey order scheme except that it reserves even
and negative integers for later insertions into an existing tree. Also it stores the label of each node as the
compressed binary representation and the ancestor–descendant or parent–child structural relationship
between two nodes is determined by the substring comparison. Because of the reserved even and negative
integers, almost no node re-labeling occurred for new data insertions. When the sizes of the OrdPath codes
overflow which means it must re-label all the existing nodes.
TJFast: One of the other enhancement and extension for Dewey labelling scheme was proposed by Lu et
al. [5]. All the elements names along the path from the root to the element can be derived from the label of
an element alone. Based on extended Dewey they design a novel holistic twig join algorithm, TJFast. To
answer a twig query, TJFast only needs to access the labels of the leaf query nodes. This reduces disk access
and supports an efficient evaluation of queries with wildcards in branching nodes. However, this scheme is
not suitable for the update processing because the labels of whole nodes and the finite state transducer must
be reconstructed after data insertions.
Dewey with even numbers: Theo Harder et al. [8] suggest an insert-friendly labeling scheme based on
Dewey IDs. Each label is composed of divisions that reflect the element id as well as its parents’ IDs. In the
initial document labeling, only odd numbers are used with some gaps (they used a gap of 4) between the
numbers to facilitate the insertion of new labels. In case there is no space between the elements for insertion,
a new division is introduced with an even number which is not counted when computing the level of the
element.
DDE: A DDE [4] is tailored for both static and dynamic XML documents. For static documents, the
labels of DDE are the same as Dewey which yield compact size and high query performance. Unlike
containment labels which have level fields, the level information is implicitly represented by the label. A
DDE label implicitly stores the level information as the number of divisions in that label. This property will
remain true after random insertions and deletions. When updates take place, DDE can completely avoid relabeling and its label quality is most resilient to the number and order of insertions compared to the existing
approaches. Fig.2 demonstrates processing insertions with DDE labels.
CDDE: CDDE [4] is designed to optimize the performance of DDE for insertions. The label format of
CDDE is the same as DDE which is a sequence of divisions separated by ‘.’. Moreover, the initial labeling of
CDDE is the same as DDE. Unlike DDE labels whose first divisions are restricted to be positive decimal
numbers, the first division of a CDDE label can be either positive or negative. We refer to the CDDE labels
with positive first divisions as positive CDDE labels and those with negative first divisions as negative
CDDE labels. Fig.3 shows the insertions process with CDDE labels.
22
3. Evaluation Criteria
Since different labelling schemes have commonalities and viabilities, we attempt here to identify several
evaluation criteria to evaluate and compare the various properties of each labelling scheme.
• Storage Requirement- reflects the amount of storage space required to store the numbers or the
indexes. Also, it represents how efficiently a numbering scheme copes with the growth of the number
of nodes. This is because in most cases the index/number size is a function of the number of nodes.
• Supported Axes- represents the types of relationships between any two nodes that can be derived from
the numbering scheme
• Efficiency of Relationship Retrieval- represents the amount/type of calculation or comparisons
required to identify the relationships between two nodes given their indexes.
• Update Efficiency- indicates how efficient a numbering scheme is in the face of updates (e.g.
insertion, deletion); and whether or not it requires re-labeling of tree nodes.
4. Comparison and discussion
Storage Requirement
In general, the Dewey-based schemes have the same tactic when dealing with initial labeling or static
tree. The label division’s number of a particular node is equal to its level. These schemes differ when update
XML tree after insertions. In Dewey-based schemes the length of the label representation becomes longer by
frequent data insertions. For instance, in Dewey with even numbers labeling scheme the number of divisions
in the labels can grow with the horizontal and vertical growth of the xml tree.
Supported Axes
The different studies labelling schemes support almost all the common axes such as P-C, A-D and
sibling relationships and document order. That is, it is possible to determine if a given label A is the parent/
ancestor/ sibling/ preorder of another label B. In addition, it is possible whether label A precedes another
label B in document order. Moreover, some of these schemes like DDE, CDDE can support another axes like
Lowest Common Ancestor (LCA).
Efficiency of Relationship Retrieval
In the most surveyed schemes, such as Dewey ID, OrdPath, TJFast, and Even labeling schemes, only
simple comparisons are required to retrieve the relationship between any two given indexes the same as
when dealing with initial Dewey coding. In DDE and CDDE, verifying the relationship between given nodes
require a lot of calculations.
Table 1: Evaluation Summary of the Dynamic XML Labeling Schemes
Scheme
Dewey ID
DDE
CDDE
OrdPath
TJFast
Even-labeling
Storage Requirement
Supported Axes
Medium
Medium
Medium
High
Medium
High
Common axes
common axes
common axes
A-D, P-C
A-D, P-C
Common axes
Efficiency of Relationship
Retrieval
Simply retrieved
Perform calculations
Perform more calculations
Simply retrieved
Simply retrieved
Directly
Update Efficiency
re-label the sibling nodes
No re-labeling
No re-labeling
Re-labeling
Re-labeling
Scan and perform calculations
Update Efficiency
The need for XML tree updating arises when deleting existing nodes or inserting new nodes. The
deletion of labels does not affect the order of other labels while the insertions without re-labelling nodes are
not easy task. Essentially, the insertion could be leftmost, rightmost or between consecutive siblings. The
insertion between consecutive nodes vary form scheme to another. Although in the Dewey-based schemes do
not require re-labeling, they might require loading both the previous and following siblings’ indexes and
doing some computations. This kind of operations can be very costly especially when the indexes are huge.
In Table 1, we summarize the evaluation of the studied dynamic XML labeling schemes according to the
considered criteria (introduced in Section3).
5. New approach: Sibling-labeling scheme
23
One of the most important qualities of any re-labeling scheme is its behavior upon tree insertion or
deletion. Most of the techniques in the literature either require massive tree re-labeling after intensive update
operations or involve complex calculations to generate the new labels. Dewey encoding supports all the
common axes as well as retrieves the relationship between given nodes easily form their labels. However,
one huge drawback of Dewey labelling is that it certain update operations are very costly and might require
the complete re-labelling of the whole tree (except the root). Other Dewey-based schemes, as discussed in
Section 4, overcome this drawback and avoid re-labelling but retrieving the different axes is non-trivial task
and needs to perform some computations that may negatively affect the performance.
Motivated by this, we suggest the sibling-labeling scheme that is based on Dewey encoding. Our
technique avoids complex operations and requires at most the re-labeling of two adjacent nodes. Thereby, the
proposed sibling-labeling scheme can reduce the drawbacks related to the Dewey-based schemes
significantly. The basic idea is to add two divisions to the indexes in order to reflect the preorder numbers of
the next and previous siblings. This is shown in Figure 1; the root node is labeled 0.0.1 which means that it is
the first node in the level and does not have any siblings. The first child is labeled 0.2.1.1 where the last two
divisions are the Dewey label of the node and the first two divisions are the preorder numbers of the previous
and next siblings respectively. This way, only the second division of the previous node and the first division
of the next node need to be updated in the worst case. This means the maximum number of nodes re-labeling
required is 2. Figure 1 shows all the possible cases of insertion. In the first case, the node is inserted as the
last child (right most) and only the second division of the previous node needs to be updated. Figure 1b
shows the worst case scenario where the node is inserted between two existing nodes in which case only 2
re-labeling operations are required, the second division of the previous node and the first division of the next
node which both point to the preorder of the newly inserted node.
In terms of update efficiency, the sibling-labeling is more efficient than the Dewey labeling scheme in
the sense that it never requires the update of more than two nodes. The only situation where the Dewey
labeling scheme requires no updates is during the right most insertion in which case a single update is needed
by the sibling-labeling scheme. Of course this is if we exclude the single childe insertion because none of the
techniques should require re-labeling in this case. In comparison to the even-labeling [8] the sibling-labelling
performs very closely. This is due to the fact that in both techniques both the next and previous node indexes
are read. As for the even-labeling, it might need to scan many divisions of these two indexes and then apply
certain rules and sometimes perform certain operations to generate the new label. For example, if n1=
1.9.5.7.5 and n3= 1.9.5.7.16.5 then n2 will have the label 1.9.5.7.11 when inserted between n1 and n3. This
label is generated by scanning n1 and n3 and taking the median of the first differing divisions (5 and 16) to
generate the odd number 11. This kind of operations can be very costly especially when the labels are huge.
The sibling-labeling scheme on the other hand will scan at most the first two divisions of the previous and
next indexes and perform the update operation on one of these divisions as mentioned earlier. As for the
storage efficiency, the sibling-labelling is slightly inferior to the Dewy labelling because the sibling-label
will always have two more divisions than the normal Dewy label. Compared to the even-labelling scheme,
our sibling-labelling scheme can be more costly in case the tree has not undergone any updates. This is
because
Original Tree
a) Left Insertion
c) Right Insertion
Fig.1:All the cases of insertion in the sibling-labeling.
24
d) Middle Insertion
in this case, the even-labeling is identical to the Dewy labeling in terms of storage requirements. After
massive updates on the other hand, the sibling-labeling outperforms the even-labeling since the indexes in
the even-labeling can grow (get inflated) both horizontally (insertion at an existing level) and vertically (as
the height increases). For instance, if n nodes are to be inserted before the node d1= 1.5.2.1 and d1 is the first
child, then with the even-labeling the nth node will have a label that consists of n+1 2s (1.5.2.2.2...n+1.[odd
number]). Also, there are no restrictions on the number of even divisions that can be added to the label nor
there is a restriction on how big the even number is. Tables 2 and Table 3 summarize the above discussion.
Table 2: Storage Requirements
Scheme
Dewey
Even-labeling
Sibling-labeling
Initial tree
Length(dewey_label)
Length(dewey_label)
Length(dewey_label) + 2
Initial tree
Length(dewey_label)
Length(dewey_label) + N
Length(dewey_label) + 2
Table 3: Update Efficiency
Scheme
Dewey
Even-labeling
Sibling-labeling
Left insertion
Re-label all the following and their siblings
Scanning and perform calculations on next
sibling but no re-labeling
1
Right insertion
No re-labeling
No re-labeling
1
Middle insertion
Re-label all the following and their siblings
Scanning and perform calculations on next
& previous siblings but no re-labeling
2
6. Conclusion and Future Work
In this work, we reviewed, evaluated and compared a number of dynamic labeling schemes that support
XML data updates. Specifically, we considered deferent labelling schemes that based on Dewey encoding.
These labelling schemes can be considered superior in terms of update efficiency; however, they are not
prefect in terms of storage requirements. Additionally, we proposed a novel labeling scheme, sibling-labeling
scheme which is based on the famous Dewey coding. By using our scheme two nodes at most can be relabeled. Nonetheless, sibling-labeling is very perfect to support the common axes and retrieve the
relationships between the given nodes straightforwardly. As future work, we plan to develop our own Dewey
compression mechanism or test some of the existing compression techniques in order to reduce the label
storage overhead.
7. Acknowledgments
The authors would like to thank KFUPM for supporting this research.
8. References
[1] T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, and F. Yergeau. Extensible markup language (XML) 1.0
third edition W3C recommendation. 2000.
[2] A. Berglund, S. Boag, D. Chamberlin, M.F. Fernandez, M. Kay, J. Robie, and J. Simon. XML path language
(XPath) 2.0, W3C working draft 16. Technical ReportWD-xpath20-20020816, World Wide Web Consortium,2002.
[3] S. Boag, D. Chamberlin, M. F. Fernandez, D. Florescu, J. Robie, and J. Simon. XQuery 1.0: An XML Query
Language, W3C working draft 16. Technical Report WD-xquery-20020816, World Wide Web Consortium, 2002.
[4] Liang Xu, Tok Wang Ling, Huayu Wu, Zhifeng Bao. DDE: From Dewey to a Fully Dynamic XML Labeling
Scheme. In SIGMOD Conference, 2009, pp719-730.
[5] I. Tatarinov, S. Viglas, K. S. Beyer, J. Shanmugasundaram, E. J. Shekita, and C. Zhang. Storing and querying
ordered XML using a relational database system. In Proc. Of SIGMOD, 2002, pp. 204-215.
[6] P. O'Neil, E. O'Neil, S. Pal, I. Cseri, G. Schaller, and N. Westbury. ORDPATHs: Insert-friendly XML node labels.
In Proceedings of the ACM SIGMOD, 2004, pp. 903- 908.
[7] J. Lu, Ling, T.W., Chan, C., Chen, T. From region encoding to extended Dewey: on efficient processing of XML
twig pattern matching,. In: Proceedings of the VLDB, 2005, pp. 193–204.
[8] T. H¨arder, M. Haustein, C. Mathis, M. Wagner. Node Labeling Schemes for Dynamic XML Documents
Reconsidered. Data & Knowledge Engineering, 2006.
25
Download